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EXPRESSION PATTERNS 

(57) Abstract 

The present invention provides methods for enhanced detection of biological response patterns. In one embodiment of the invention, 
genes are grouped into basis gcnesets according to the co-regulation of their expression. Expression of individual genes withm a gcncsct 
is indicated witfTa single gene expression value for the gencsel by a projection process. The expression values of gencsets, rather than 
the expression of individual genes, are then used as the basis for comparison and detection of biological response with greatly enhanced 
sensitivity In another embodiment of the invention, biological responses are grouped according to the similarity of their biological profi e. 
The mediods of die invenUon have many useful applications, partlculariy in the fields of drug development and discovery. For example, 
the methods of die invention may be used to compare biological responses wift greatly enhanced sensitivity. The biological responses 
dial may be compared according to these mcdiods include responses to single pertuibations, such as a biological response to a mutation or 
temperature change, as well as graded perturbations such as titration with a particular drug. The methods are also useful to identify cellular 
constitocnts. particularly genes, associated with a particular type of biological response. Further, die mediods may also te used to identify 
perturbations, such as novel drugs or mutations, which effect one or more particular genesets. The methods may still further be used to 
remove experimental artifacts in biological response data. 
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METHODS FOR USING CO-REGULATED GENESETS TO ENHANCE 
nFTFCnON AND CLASSIFICATION OF G ENE EXPRESSION PATTERNS 

This is a continuation-in-part of plication serial no. 09/220,275, filed on 
5 December 23, 1998, which is a continuation-in-part of appUcation serial no. 09/179,569 
filed October 27, 1998, each of which is incorporated herein, by reference, in its entirety. 

1. PTFT n OF THE INVENTION 
The field of this invention relates to methods for enhanced detection of biological 
10 responses to perturbations. In particular, it relates to methods for analyzing stracture in 
biological expression patterns for the purposes of in^roving the ability to detect certain 
specific gene regulations and to classify more accurately the actions of compounds that 
produce complex pattrans of gene regulation in flie cell. 

15 2. p ArgnROT JND OF THH INVENTION 

Within flie past decade, several technologies have made it possible to monitor the 
expression level of a large number of transcripts at any one time (see, e.g., Schena et al, 
1995, Quantitative monitoring of gene expression patterns with a complementary DNA 
micro-airay, Science 270:467-470; Lockhart et al, 1996, Expression monitoring by 
20 hybridization to high-density oligonucleotide arrays. Nature Biotechnology 14:1675-1680; 
Blanchard et al, 1996, Sequence to array: Probing the genome's secrets, Natarg 
Biotechnology 14. 1649; U.S. Patent 5,569,588. issued October 29, 1996 to Ashby et al 
entitted "Methods for Dmg Screening"). In organisms for which the complete genome is 
known, it is possible to analyze the transcripts of all genes within the ceU. With other 
25 organisms, such as human, for which there is an increasing knowledge of the genome, it is 
possible to simultaneously monitor large numbers of the genes within the cell. 

Such monitoring technologies have been appUed to the identification of genes which 
are up regulated or down regulated in various diseased or physiological states, the analyses 
of members of signaling cellular states, and the identification of targets for various drugs. 
30 See, e.g.. Friend and Hartwell, U.S. Provisional Patent AppUcation Serial No. 60/039,134, 
filed on February 28, 1997; Stoughton, U.S. Patent Application Serial No. 09/099,722, filed 
on June 19, 1998; Stoughton and Friend, U.S. Patent ^plication Serial No. 09/074,983, 
filed on filed on May 8, 1998; Friend and Hartwell, U.S. Provisional Application Serial No. 
60/056,109, filed on August 20. 1997; Friend and Hartwell, U.S. Application Serial No. 
35 09/031,216, filed on February 26, 1998; Friend and Stoughton, U.S. Provisional 

Application Serial Nos. 60/084.742 (filed on May 8. 1998). 60/090.004 (filed on June 19, 
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1998) and 60/090,046 (filed on June 19, 1998), all incorporated herein by reference for all 
purposes. 

Levels of various constituents of a cell are known to change in response to drug 
treatments and other perturbations of the cell's biological state. Measurements of a plurality 

5 of such "cellular constituents" therefore contain a wealth of mfoimation about the effect of 
pertuibations and their eflfect on the cell's biological state. Such measurements typically 
comprise measurements of gene expression levels of the type discussed above, but may also 
include levels of ofliCT cellular components such as, but by no means limited to, levels of 
piotein abundances, or protein activity levels. The collection of such measurements is 

10 g«ierally referred to as the •'profile" of the cell's biological state. 

The number of cellular constituents is typically on the order of a hundred thousand 
for mammalian cells. The profile of a particular cell is tfierefore typically of high 
complexity. Any one perturbmg agent may cause a small or a large number of cellular 
constituents to change their abundances or activity levels. Not knowing what to expect in 

15 response to any given pCTturbation will therefore require measuring independently the 
responses of these about 10^ constituents if the action of the perturbation is to be completely 
or at least mostly characterized. The complexity of the biological response data coupled 
with measurement errors makes such an analysis of biological response data a challenging 
task. 

20 Current techniques for quantifying profile changes suffer fi-om high rates of 

measurement error such as false detection, failures to detect, or inaccurate quantitative 
determinations. Therefore, there is a great demand in the art for methods to enhance the 
detection of structure in biological expression patterns. In particular, there is a need to find 
groups and structure in sets of measurements of cellular constituents, eg., in the profile of a 

25 cell's biological state. Examples of such structure include associations between the 

regulation of the expression levels of different genes, associations between different drug or 
drug candidates, and association between the drugs and the regulation of sets of genes. 

Discussion or citation of a reference herein shall not be construed as an admission 
30 that such refi^rence is prior art to the present invention. 

3. SUMMARY OF THE INVENTION 
This invention provides methods for enhancing detection of structures in the 
response of biological systems to various perturbations, such as the response to a drug, a 
35 drug candidate or an experimental condition designed to probe biological pathways as well 
as changes in biological systems that correspond to a particular disease or disease state, or 



WO00«4936 PCT/US99/25a25 

to a treatment of a particular disease or disease state. The methods of this invention have 
extensive applications in the areas of drug discovery, drug therapy monitoring, genetic 
analysis, and clinical diagnosis. This invention also provides apparatus and computer 
instructions for performing the enhanced detection of biological response patterns, drug 
5 discovery, monitoring of drug therapies, genetic analysis, and clinical diagnosis. 

One aspect of the invention provides methods for classifying cellular constituents 
(measurable biological variables, such as gene transcripts and protein activities) into groups 
based upon the co-variation among those cellular constituents. Each of the groups 
has cellular constituents that co-vary in response to perturbations. Those groups are termed 
10 cellular constituent sets. 

In some specific embodiments, genes are grouped according to the degree of co- 
variation of their transcription, presumably co-regulation. CJroups of genes that have co- 
vaiying transcripts are termed genesets. Cluster analysis or other statistical classification 
methods are used to analyze the co-variation of transcription of genes in response to a 
15 variety of perturbations. In preferred embodiments, the cluster analysis or other statistical 
chissification methods use a novel "distance" or "similarity" metric to evaluate the 
similarity {i.e., the co-variance) of two or more genes (or other cellular constituents) in 
response to the variety of perturbations. In one specific embodiment, clustering algorithms 
are appUed to expression profiles {e.g., a collection of transcription rates of a number of 
20 genes) obtained under a variety of cellular perturbations to construct a "similarity tree" or 
"clustering tree" which relates cellular constituents by the amount of co-regulation 
exhibited. Genesets are defined on the branches of a clustering tree by cutting across the 
clustering tree at different levels in the branching hierarchy. In some embodiments, the 
cutting level is chosen based upon the number of distinct response pathways expected for 
25 the genes measured. In some other embodiments, the tree is divided into as many branches 
as they are truly distinct in terms of minimal distance value between the individual 
branches. 

In some preferred embodiments, objective statistical tests are employed to define 
truly distinct branches. One exen^lary embodiment of such a statistic^ test employs 

30 Monte Carlo randomization of the perturbation index for each gene's responses across all 
perturbations tested. In some preferred embodiments, the cut ofFlevel is set so that 
branching is significant at the 95% confidence level. In preferred embodiments, clusters 
with one or two g«ies are discarded, .hi some other embodiments, howevCT, small clusters 
with one or two genes are included in genesets. In more detail, the preferred statistical tests 

35 of tiie invention comprise (a) obtaining a measure of the "compactness" of clusters (i.e., 
cellular constituent sets such as gene sets) detennined by the above mentioned cluster 
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analysis or other statistical techniques, and (b) comparing the thus obtained measure of 
compactness to a hypothetical measure of compactness of cellular constituents regrouped in 
an mcreased number of clusters. Such a comparison typically comprises determining the 
difference in the compactness of the two sets of clusters. Further, by employing Monte 

5 Carlo randomization of the perturbation index for each gene's responses across all 
perturbations tested, a statistical distribution of the diflfercnce in the compactness is thus 
generated. The statistical significance of the actual diflfercnce in compactness can then be 
determined by comparing this actual difference in compactness to the statistical distribution 
of the differences in compactness fiom the Monte Carlo randomizations. 

10 As the diversity of perturbations in the clustering set becomes very large, the 

genesets which are clearly distinguishable get smaller and more numerous. However, it is a 
discovery of the inventors that even over very large experiment sets, there is a number of 
genesets that retain their coherence. These genesets are termed irreducible genesets. In 
some embodiments of the invaition, a large number of diverse perturbations are applied to 

15 obtain these irreducible genesets. 

Statistically dwived genesets may be refined using regulatory sequence information 
to confirm members that are co-regulated, or to identify more tightly co-regulated 
subgroups. In such embodiments, genesets may be defined by their response pattern to 
individual biological experimental perturbations such as specific mutations, or specific 

20 growth conditions, or specific compounds. The statistically derived genesets may be further 
refined based upon biological understanding of gene regulation. In another preferred 
embodiment, classification of genes into genesets is based first upon the known regulatory 
mechanisms of genes. Sequence homology of regulatory regions is used to define the 
genesets. In some embodiments, genes with common promoter sequences are grouped into 

25 onegeneset 

In preferred embodiments, the cluster analysis and statistical classification methods 
of this invention analyze co-variation, e,g.j of transcription levels of mdividual genes, by 
means of an objective, quantitative "similarity" or "distance" fimction which provides a 
useful measurement of the similarity of expression levels for two or more cellular 

30 constituents (e.g., for two or more genes). Accordingly, the present invention provides 
novel similarity or distance function which are particularly useful for analyzing the co- 
variation of cellular constituents, including the co-variation of gene transcript levels. The 
invention ^so provides objective statistical tests, in particular MonteX:arlo procedures, for 
assessmg the significance of the cellular constituent sets or genesets obtained by the 

35 methods of this invention. Finally, the clustering methods of this invention are equally 
applicable to the clustering of bsttl cellular constituents Mid biological profiles according to 
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their similarities. Thus, in another aspect, the present invention provides methods for 
simultaneous clustering in both dimension of a tabular data set. In preferred embodiments, 
the data set is a table of numbers representing the levels or changes in level, of a plurality 
of cellular constituaits in response to different conditions, perturbations, or conditions 
5 pairs. 

Another aspect of the invention provides methods for expressing the state (or 
biological responses) of a biological sample on the basis of co-vaiying cellular constituent 
sets. In some embodimrats, a profile containing a phirality of measurements of cellular 
constituents in a biological sanqile is converted into a projected profile containing a 

10 pluraUty of cellular constituent set values according to a definition of co-varying basis 
cellular constituent sets. In some prefeired embodiments, the cellular constituent set values 
are the average of the cellular constituait values within a cellular constituent set. hi some 
other embodiments, the cellular constituent set values are derived ftom a linear projection 
process. The projection operation expresses the profile on a smaller and biologically more 

1 5 meaningful set of coordinates, reducing the effects of measurement errors by averaging 
them over each cellular constituent sets, and aiding biological interpretation of the profile. 

The method of the invention is particularly usefiil for the analysis of gene expression 
profiles. In some embodiments, a gene expression profile, such as a collection of 
transcription rates of a number of genes, is converted to a projected gene expression profile. 

20 The projected gene expression profile is a collection of geneset expression values. The 
conversion is achieved, in some embodiments, by averaging the transcription rate of the 
genes within each geneset. hi some other embodfanents, other linear projection processes 
may be used. 

In y^ another aspect of the invention, methods for comparing cellular constituent set 
25 values, particularly, geneset e:q)ression values are provided. In some anbodiments, the 
expression of at least 10, preferably more than 100, more preferably more than 1,000 genes 
of a biological system is monitored. A known drug is ^lied to the system to generate a 
known drug response profile in terms of genesets. A drug candidate is also applied to the 
biological system to obtain a drug candidate response profile in tenns of genesets. The drug 
30 candidate's response profile is then compared with the known drug response profile to 
detemiine v^ether the drug candidate induces a response sunilar to the response to a known 
drug. 

In some other embodiments, the comparison of projected profiles is achieved by _ 
using an objective measure of similarity, hi some preferred embodiments, the objective 
35 measure is the generaUzed angle between the vectors representing the projections of the two 
profiles being compared (the 'normaUzed dot product'). In some other embodiments, the 
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projected profiles are analyzed by applying threshold to the amplitude associated with each 
geneset for the projected profile. If the change of a geneset is above a threshold, it is 
declared that a change is present in the geneset. 

The methods of the present invention may also be used to group biological response 
5 profiles according to the similarity of the responses of measured cellular constituents. 
Accordingly, in altemative embodiments, the present invaition provides methods for 
grouping biological responses (i.e., response profiles) according to the degree of similarity 
of the cellular constituents' responses by means of the cluster analysis or other statistical 
classification methods described supra for classification of cellular constituents (eg., genes) 
10 into co-varying sets (eg., genesets). Suchmethodsmay also be used, eg., for enhancing 
detection of structures in the responses of biological systems to various perturbations. Still 
fiirther, the present invention also provides **two-dimensional" methods of analyzing 
biological response profile data. Such methods simply comprise (1) grouping cellular 
constituents (eg., genes) according to their degree of co-variation in the response profile 
15 data, and (2) groupmg response profiles according to tiie similarity of their cellular 
constituents' responses. 

The clustering methods of the invention are particularly useful, eg., for identifying 
and/or characterizing perturf)ations (for example, drugs, drug candidates or genetic 
mutations) affecting particular cellular constituents or particular groups of cellular 
20 constituents. For example, the clustering methods can be used to identify cellular 

constituents (eg., genes and proteins) and/or sets of co-varying cellular constituents such as 
genesets whose changes in expression or abundance are associated with a particular 
biological effect such as a particular disease state or the effect of one or more drugs. 
Further, the clustering methods of the invention are also useful, eg, for identifying cellular 
25 constituents, such as genes or gene transcripts, involved in a particular biological response 
or pathway. Thus, the invention further provides methods for identifying cellular 
constituents, such as genes or gene transcripts, associated with a particular biological 
response or pathway by means of the cluster analysis methods described supra. The 
invention still further provides methods for identifying biological •^perturbations", for 
30 example drugs, drug candidates, or genetic mutations which •*perturi)" a biological system, 
effecting particular cellular constituents or particular groups of cellular constituents by 
means of the cluster analysis methods described supra. The cellular constituents and 
perturi^ations identified by the methods of the invention may be known or previously 
unknown. Thus, the invention provides methods for identifying, eg, novel genes and 
35 drugs or drug candidates as well previously known genes and drugs/drug candidates which 
were not previously known to be associated with a particular biological effect of interest. 



PCT/US99/2502S 

WO 00/24936 

The methods of the present invention may also be used to remove one or more 
artifects from a measured biological profile {i.e., from a measure profile comprising a 
plurality of measurements of cellular constituents). Thus, the invention provides methods 
for removing such artifacts from a measured biological profile by subtracting one or more 
5 artifact patterns from the measured biological profile, wherein each artifect pattern 
corresponds to a particular artifact. 

The methods of the invention are preferably implemented with a computer system 
capable of executing cluster analysis and projection operations. In some embodiments, a 
10 computer system contains a computer-usable medium having computer readable program 
code embodied. The computer code is used to effect retrieving a definition of basis 
genesets from a database and converting a gene expression profile into a projected 
expression profile according to the retrieved definition. 



15 



4. BttTPF DESCBTPTTON OF TH F. DRAWINGS 
Fig. 1 illustrates an embodiment of the cluster analysis. 



Fig. 2 illustrates the projection process. 

20 Fig. 3 illustrates an exemplary geneset database management system. 

Fig. 4A illustrates two different possible responses to receptor activation. 

Fig. 4B illustrates three main clusters of yeast genes with distinct tanporal 
25 behavior. 

Fig. 5 illustrates a conq)uter system usefiil for embodiments of the invention. 

Fig. 6 shows a clustering tree derived from 'hclusf algorithm operating on a table of 
30 18 experiments by 48 mRNA levels. 

Fig. 7 shows a clustering tree derived from 34 experiments. 

Fig. 8A-E shows amplitudes of the individual elements of the projected profile. 

35 
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Fig. 9 shows results of correlating the profile of FK506 (16 ng/ml) treatment with 
the profiles of each of the 34 experiments used to generate the basis genesets. 

Fig. 10 illustrates an exemplary signaling cascade which includes a group of up- 
5 regulated genes (Gi. G2, and G3) and a group of down regulated genes {G4, G5, and G6). 

Fig. 1 1 is the clustering tree obtained by the hclust algorithm to identify clusters 
(,-.e., genesets) among 185 genes whose expression levels were measured in 34 perturbation 
response profiles. 

10 

Fig. 12 iUustrates an exemplary, two-dimensional embodunent of the Monte Carlo 
method for assigning significance to cluster subdivisions. 

Fig. 13 shows the transcriptional response of the largest responding genes of 5. 
15 cerevisiae to different concentrations of the drug FK506. 

Fig. 14 shows projected titration curves obtained by projecting the titration curves of 
Fig. 13. 

20 Fig. 15 is chi-squared plotted around the values of the two ffiU coefficients n and Ug 

derived for each geneset in Fig. 14. 

Fig. 16A-D illustrates an exemplary application of the methods of the invention; 
Fig. 16A is a grey scale display of 185 genetic transcripts of 5. Cerevisiae (horizontal axis) 

25 measured in 34 different perturbation experiments (vertical axis); Fig. 16B shows the co- 
regulation tree obtained by clustering the genetic transcripts of Fig. 16A using (he 'hclust' 
algorithm; Fig. 16C is an illustration of the same experimental data in which the transcripts 
(horizontal axis) have been re-ordered according to the genesets defined from Fig. 16B; 
Fig. 16D is another illustration of the experimental data in which the experimental index 

30 (vertical axis) has also been reordered according to similarity of the response profiles. 

Fig. 17 is another Ulustration of the data in Fig. 16 in which the genetic transcripts 
(horizontal axis) and experiments (vertical axis) are ordered according to similarity; , 
individual genesets are identified above the fiilse color image, while the biological pathways 
35 and/or responses with which each geneset is associated are indicated below the image; the 
label on the vertical axis summarizes each racperiment. 
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Fig. 18 shows the correlation of the expression profiles of a (believed to be) 
uncontaminated experiment measuring the effect of deletion of the gene YJL107c in S. 
cerevisiae and an identical experiment unintentionally contaminated with an artifact (poor 
control of RNA concentration during reverse transcription. 

5 

Fig. 19 shows a profile, plotted as gene expression ratio vs. mean expression level, 
corresponding to poor control of RNA concentration in a revCTse transcription procedure 
during hybridization sample preparation. 

10 Fig. 20 shows the correlation ofthe expression profile ofa (believed to be) 

uncontaminated experiment measuring the effect of deletion of flie gene YJL107c in S. 
cerevisiae and an identical expeiimait unintentionally contaminated with an artifact (poor 
control of RNA concwitration during reverse transcription) wherein the data firom the 
contaminated has been "cleaned" using the response profile in Fig. 19 as a "template" ofthe 

IS artifiict 



5. DETAILED DESCRIPTION 
This section presents a detailed description of the invention and its ^plications. 
This description is by way of several exemplary illustrations, in increasing detail and 
20 specificity, of the general methods of this invention. These examples are non-limiting, and 
related variants will be ^parent to one of skill in the art. 

Although, for sunpbcity, this disclosure often makes references to gene expression 
profiles, transcriptional rate, transcript levels, etc., it will be understood by those skilled in 
the art that the methods ofthe inventions are useful for the analysis of any biological 
25 response profile. In particular, one skiUed in the art will recognize that the methods of the 
present invention are equally appUcable to biological profiles which comprise 
measurements of other cellular constituents such as, but not limited to, measurements of 
protein abundance or protein activity levels. 



30 5.1. TNTRQDUCTION 

The state of a cell or other biological sample is represented by cellular constituents 
(any measurable biological variables) as defined in Section 5.1.1, infra. Those cellular 
constituents vary in response to perturbations. A group of cellular constituents may co-vary 
in response to particular perturbations. Accordingly, one aspect of the present invention 

35 provides metiiods for grouping co-varymg cellular constituents. Each group of co-varying 
cellular constihients is termed a cellular constitiient set. This invention is partially premised 
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Upon a discovery of the inventors that the state of a biological sample can be more 
advantageously represented using cellular constituent sets rather than individual cellular 
constituents. It is also a discovery of the inventors that the response of a biological sample 
can be better analyzed in terms of responses of co-varying cellular constituent sets rather 

S than cellular constituents. 

In some prefeired specific embodiments of this invention, genes are grouped into 
basis genesets according to the regulation of their expression. Transcriptional rates of 
individual genes within a geneset are combined to obtain a single gene expression value for 
the geneset by a projection process. The expression values of genesets, rather than the 

10 transcriptional rate of individual genes, are then used as the basis for the comparison and 
detection of biological responses with greatly enhanced sensitivity. 

This section first presents a background about representations of biological state and 
biological responses in terms of cellular constituents. Next, a schematic and non-limiting 
overview of the invention is presented, and the representation of biological states and 

15 biological responses accordmg to the method of this invention is introduced. The following 
sections present specific non-limitmg embodiments of this invention in greater detail. 

5.1.1. DEFINITION OF BIOLOGICAL STATE 
As used in herein, the term "biological sample" is broadly defined to include any 

20 cell, tissue, organ or multicellular organism. A biological sample can be derived, for 
example, fi-om cell or tissue cultures in vitro. Alternatively, a biological sample can be 
derived fi-om a living organism or firom a population of single cell organisms. 

The state of a biological sample can be measured by the content, activities or 
structures of its cellular constituents. The state of a biological sample, as used herem, is 

25 taken from the state of a collection of cellular constituents, which are sufficient to 
characterize the cell or organism for an intended purpose including, but not limited to 
characterizing the effects of a drug or other perturbation. The term "cellular constituenf ' is 
also broadly defined in this disclosure to ^compass any kind of measurable biological 
variable. The measurements and/or observations made on the state of these constituents can 

30 be of their abundances (/.e., amounts or concentrations in a biological sample), or their 
activities, or their states of modification (eg., phosphorylation), or other measurements 
relevant to the biology of a biological sample. In various embodiments, this invention 
includes making such measurements and/or observations on differentcollections of cellular 
constituents. These different collections of cellular constituents are also called herein 

35 aspects of the biological state of a biological sample. 
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One aspect of the biological state of a biological sample a ceU or cell culture) 
usefully measured in the present invention is its transcriptional state. In fact, the 
transcriptional state is the currently preferred aspect of the biological state measured in this 
invention. The transcriptional state of a biological sample includes the identities and 

5 abundances of the constittient RNA species. especiaUy mRNAs. in the cell under a given set 
of conditions. Preferably, a substantial fraction of all constituent RNA species in the 
biological sample are measured, but at least a sufficient fraction is measured to characterize 
the action of a drug or other perturbation of interest. Tlie transcriptional state of a 
biological sample can be conveniently determined by, e.g., measuring cDNA abundances by 

10 any ofseveral existing gene expression technologies. One particularly preferred 

embodiment of the invention employs DNA arrays for measuring mRNA or transcript level 
of a large number of genes. 

Another aspect of the biological state of a biological sample usefully measured in 
the present invention is its translational state. The translational state of a biological sample 

15 includes the identities and abundances of the constituent protein species in the biological 
sample under a given set of conditions. Preferably, a substantial fraction of all constituent 
protein species in the biological sample is measured, but at least a sufficient fraction is 
measured to characterize the action of a drug of interest. As is known to those of skill in the 
art, the transcriptional state is often representative of the translational state. 

20 Other aspects of the biological state of a biological sample are also of use in this 

invention. For example, the activity state of a biological sample, as that term is used herein, 
includes the activities of the constituent protein species (and also optionally catalytically 
active nucleic acid species) in the biological sample under a given set of conditions. As is 
known to those of skill in the art. the translational state is often representative of the activity 

25 state. 

This invention is also adaptable, where relevant, to "mixed" aspects of the biological 
state of a biological sample in which measurements of different aspects of the biological 
state of a biological sample are combined. For example, in one mixed aspect, the 
abundances of certain RNA species and of certain protein species, are combined with 

30 measurements ofthe activities ofcertain other protein species. Further, it will be 

appreciated from the following that this invention is also adaptable to other aspects ofthe 
biological state ofthe biological sample that are measurable. 

The biological state of abiological sample (eg., a cell or cell culture) is represented 
by aprofile of some number of cellular constituents. Such aprofile of ceUular constituents 

35 can be represented by the vector 5. 
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5 = [S, , . . Si t . . 

Where 5, is the level of the I'th cellular constituent, for example, the transcript level of 

5 gene i, or alternatively, the abundance or activity level of protein /. 

In some embodiments, cellular constituents are measured as continuous variables. 
For example, transcriptional rates are typically measured as number of molecules 
synthesized per unit of time. Transcriptional rate may also be measured as percentage of a 
control rate. However, in some other embodiments, cellular constitaaits may be measured 

10 as categorical variables. For example, transcriptional rates may be measured as either "on" 
or "off", where the value "on" indicates a transcriptional rate above a predetermined 
threshold and value "off" indicates a transcriptional rate below diat threshold. 

5.1.2. TtFPRPJiENTATTON OF BIOi nOTr AT. RESPONSES 
1 5 The responses of a biological sample to a perturbation, such as the apphcation of a 

drug, can be measured by observing the changes in the biological state of the biological 
sample. A response profile is a collection ofchanges of cellular constituents. In the present 
mvention, the response profile of a biological sample (e.g.. a cell or cell culture) to the 
perturbation m is defined as the vector i/"^: 



Where v" is the amplitude of response of cellular constituait i under the 

perturbation m. In some particularly prefarred embodiments of this invention, the 
biological respwjse to the application of a drug, a drug candidate or any other perturbation, 
is measured by the induced charge in the transcript level of at least 2 genes, prefarably more 
than 10 genes, more preferably more than 100 genes and most preferably more than 1 ,000 
goies. 

In some embodiments of the invention, the response is simply the difference 
brtween biological variables before and after perturbation. In some preferred embodiments, 
the response is defmed as the ratio of cellular constituents before and after a perturbation is 
applied. In otiier embodiments, the response may be a function of time after the 
perturbation, i.c., V"" = ^^''^t). For example y^'^{t) may be the difference or ratio of cellufar 
constituents before the perturbation and at time t after the perturbation. 

In some preferred embodiments, v," is set to zero ifthe response ofgenei is below 

some threshold amplitude or confidence level detennined fiom knowledge of the 
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measurement error behavior. In such embodiments, those cellular constituents whose 
measured responses are lower than the threshold are given the response value of zero, 
whereas those cellular constituents whose measured responses are greater than the threshold 
retain their measured response values. ' This truncation of the response vector is a good 

5 strategy when most of the smaller responses are expected to be greatly dominated by 
measurement error. After the truncation, the response vector v^"^ also approximates a 
'matched detector' {see, e.g.. Van Trees, 1968, Detection, Es timation, and Modulation 
Theory Vol L Wiley & Sons) for the existaice of similar perturbations. It is q)parent to 
those skilled in the art that the truncation levels can be set based upon the purpose of 

10 detection and the measurement errors. For example, in some embodiments, genes whose 
transcript level changes are lower than two fold or more preferably four fold are given the 
value of ZCTO. 

In some preferred embodiments, perturbations are applied at several levels of 
strength. For example, different amounts of a drug may be applied to a biological sample to 
15 observe its response. In such embodiments, the perturbation responses may be interpolated 
by approximating each by a single parameterized "model" function of the perturbation 
strength «, An exemplary model function appropriate for approxunating transcriptional 
state data is the Hill function, which has adjustable parameters a, Mq, and «. 

20 '^(">=U („/«.)• 

The adjustable parameters are selected independently for each cellular constituent of the 
perturbation response. Preferably, the adjustable parameters are selected for each cellular 
constituent so that the sum of the squares of the differences between the model function 
{e.g., the Hill function, Equation 3) and the corresponding experimental data at each 
perturbation strength is minimized. This preferable parameter adjustment method is well 
known in the art as a least squares fit. Other possible model functions are based on 
polynomial fitting, for example by various known classes of polynomials. More detailed 
description of model fitting and biological response has been disclosed m Friend and 
Stoughton, Methods of Determining Protein Activity Levels Using Gene Expression 
Profiles, U.S. Provisional Application Serial No. 60/084,742, filed on May 8, 1998, which 
is incorporated hwein by reference for all purposes. 



5.1.3. OVERVIEW OF THE INVENTION 
This mvention provides a method for enhanced detection, classification, and pattern 
recognition of biological states and biological responses. It is a discovery of the inventors 
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that biological state and response measurements, i.e., cellular constituents and changes of 
cellular constituents can be classified into co-varying sets. Expressing biological states and 
responses in tenns of those co-varying sets offers many advantages over representation of 
profiles of biological states and responses. 
5 One aspect of the invention provides methods for defining co-varying cellular 

constituent sets. Fig. 1 is a schematic view of an exemplary embodiment of this aspect of 
invention. First, a biological sample (or a population of biological samples) is subject to a 
wide variety of perturbations (101). The biological sample may be repeatedly tested under 
different perturbations sequentially or many biological samples may be used and each of the 
10 biological samples can be tested for one perturbation. For a particular type of perturbation, 
such as a drug, different doses of the perturbation may be sq>plied. 

In some particularly preferred embodiments, different chemical compounds, 
mutations, temperature changes, etc., are used as perturbations to generate a large data set. 
In most embodiments, at least 5, preferably more than 10, more preferably more than 50, 
15 most preferably more than 100 different perturbations are employed. 

In the preferred embodunent of the invention, the biological samples used here for 
cluster analysis are of the same type and from the same species as the species of interest. 
For example, human kidney cells are tested to define cellular constituent sets that are useful 
for the analysis of human kidney cells. In some other preferred embodiments, the biological 
20 samples used here for cluster analysis are not of the same type or not from the same species. 
For example, yeast cells may be used to defme certain yeast cellular constituent sets that are 
usefiil for human tissue analysis. 

The biological samples subjected to perturbation are monitored for their cellular 
constituents (level, activity, or structure change, etc.) (102). Those biological samples are 
25 occasionally referred to herein as training samples and the data obtained are referred to as 
trainmg data. The term •'monitoring" as used herein is intended to include continuous 
measuring as well as end point measurement. In some embodiments, the cellular 
constituents of tihie biological samples are measured contmuously. In other embodiments, 
the cellular constituents before and after perturbation are measured and compared. In still 
30 other embodiments, the cellular constituents are measured in a control group of biological 
sanu>les under no perturbation, and the cellular constituents of several experimental groups 
are measured and compared with those of the control group. It is apparent to those skilled 
in the art that other experimental designs are also suitable for the method of this invention, 
to detect the change in cellular constituents m response to perturbations. 
35 The responses of cellular constituents to various perturbations are analyzed to 

generate co-varying sets (103). The data are first grouped by cluster analysis according to 
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the method described in Section 5.2., infra, to generate a cluster tree which depicts the 
similarity of the responses of cellular constituents to perturbation (104). A cut off value is 
set so that the number of sets (branches) is preferably matched with the number of known 
pathways involving the cellular constituents studied (105). In some embodiments where the 

5 number of pathways is unknown, cellular constituents are clustered into the maximal 
number of truly distinct branches (or sets). 

The cellular constituent sets may be refined by utilizing the ever increasing 
knowledge about biological pathways and regulatory pathways obtained from the art (106). 
Conversely, the cluster analysis method of the invention is useful for deciphering complex 

10 biological pathways. 

In another aspect of the invention, biological state and biological responses of a 
biological sample are represented by combined values for cellular constituent sets. In one 
exemplary embodiment as depicted m Fig. 2, the cellular constituents (202) of a biological 
saiiq>le (201) are grouped into three predefined cellular constituent sets (203), (204) and 

15 (205). The measurements of the cellular constituents (202) within a cellular constituent set 
are combined to generate set values (206), (207) and (208). This step of converting from 
cellular constituent values to set values is termed ^projection.' This projection operation 
expresses the profile on a smaller and biologically more meaningful set of coordinates, 
reducing the effects of measurement errors by averaging them over each set, and aiding 

20 biological interpretation of the profile. 

Using set values does not necessarily cause loss of information by combining 
individual cellular constituent values. Because the cellular constituents within a set co- 
vaiy, individual cellular constituents provides little more mformation than the combined set 
value. In most embodiments, in this step, the quantitative description of a profile changes 

25 from a list of, for example, 100 numbers to a substantially shorter list, for example 10, 
representing the amplitude of each individual response pattem (coordinated change in any 
one geneset) needed to closely represent, in a sum, the entire profile. 

The conversion of cellular constituent values into set values, however, oflfas many 
benefits by greatly reducing the measurement errors and random variations and thus 

30 enhancing pattem detection. 

Another aspect of the invention provides methods for using the simplified 
description, or 'projection' of the profile onto cellular constituent sets in drug discovery, 
diagnosis, genetic analysis and other applications. Profiles of responses expressed in terms 
of cellular constituent sets, particularly genesets in some preferred embodiments, can be 

35 compared with enhanced accuracy. In some embodiments of the invention, a geneset 
response profile of a biological sample to an unknown perturbation, such as a drug 
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candidate, is compared with the geneset profiles generated with a number of known 
perturbations. The biological nature, such as its pharmacological activities, of an unknown 
perturbation can be determined by examining the similarity of its response profile with 
known profiles. In some embodiments, an objective measure of similarity is used. In one 

5 particularly preferred embodiment, the generalized angle between the vectors representing 
the projections of the two profiles being compared (the 'normalized dot product') is the 
objective measure. In some other embodiments, the ampUtude associated with each geneset 
for the projected profile can be masked with threshold values to declare the presence or 
absence of a diange in that geneset. This will be a more sensitive d^ector of changes in 

10 Aat goieset than one based on individual cellular constituents from fbat geneset detected 
s^arately. It is also a more accurate quantitative monitor of the an^litude of change m that 
geneset. Thus, flie presence of specific biological perturbations can be detected more 
sensitively, and similarities between the mechanisms of action of different compounds or 
perturbations discovered more efficiently. 

15 

5.2. SPECIFIC EMBODIMENT: DFFTNING BA STS GENESETS 
In this section, a preferred embodiment of the invention is described in detail. 
While the basis genesets are used as an illustrative embodiment of the invention, it is 
appareat to those skilled in the art that fliis invention is not limited to genesets and gene 
20 expression, but is useful for analyzing many types of cellular constituents. 

One particular aspect of the invention provides methods for clustering co-regulated 
genes into genesets. This section provides a more detailed discussion of methods for 
clustering co-regulated genes. 

25 5.2.1. rO.RFrTULATF.n GENES AN D GENESETS 

Cotain gaies tend to increase or decrease their expression in groups. Genes tend to 
increase or decrease their rates of transcription together when they possess similar 
regulatory sequence pattrans, i.e., transcription factor bindmg sites. This is Ae mechanism 
for coordinated response to particular signaling iin>uts (see. e.g., Madhani and Fink, 1998, 

30 The riddle of MAP kinase signaUng ^ecificity. Transactions in Genetics 14:151-155; 
Amone and Davidson, 1997, The hardwiring of development: organization and function of 
genomic regulatory systems. Development 124:1851-1864). Separate genes which make 
different components of a necessary protein or cellular structure wilUend to co-vary. , 
DupUcated genes (see, e.g.. Wagner, 1996, Genetic redundancy caused by gene dupUcations 

35 and its evolution in networks of transcriptional regulators, Bjol. Cvberp- 74:557-567) will 
also tend to co-vary to the extent mutations have not led to fimctional divergence in the 
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regulatory regions. Further, because regulatory sequences are modular {see, e.g., Yuh et 
fl/.,1998, Genomic cis-regulatory logic: experimental and computational analysis of a sea 
urchin gene, Science 279:1896-1902), the more modules two genes have in common, the 
greater the variety of conditions under which they are expected to co-vary their 

5 transcriptional rates. Separation between modules also is an important determinant since 
co-activators also are involved. In summary therefore, for any finite set of conditions, it is 
expected that genes will not all vary indq)endently, and that there are simplifying subsets of 
genes and proteins that will co-vary. These co-varying sets of genes form a complete basis 
in the mathematical sense with which to describe all the profile changes within that finite 

10 set of conditions. One aspect of the invention classifies genes into groups of co-vaiying 
genes. The analysis of the responses of these groiq)S, or genesets, allows the increases in 
detection sensitivity and classification accuracy. 

5.2.2. GENESET CLASSIFICATION BY CLUSTER ANALYSIS 

15 For many applications of the present invention, it is desirable to find basis genesets 

that are co-regulated over a wide variety of conditions. This allows the method of invention 
to work well for a large class of profiles whose expected properties are not well 
circumscribed. A preferred embodiment for identifying such basis genesets involves 
clustering algorithms (for reviews of clustering algorithms, see, e.g., Fukunaga, 1990, 

20 Statistical Pattern Recognition. 2nd Ed., Academic Press, San Diego; Everitt, 1974, Cluster 
Analvsis> London: Heinemann Educ. Books; Hartigan, 1975, Clustering Algorithms. New 
York: Wiley; Sneath and Sokal, 1973, Numerical Taxonomv. Freemai^ Anderberg, 1973, 
Cluster Analysis for Applications. Academic Press: New York). 

In some embodiments employing cluster analysis, the expression of a large number 

25 of genes is monitored as biological samples are subjected to a wide variety of perturbations 
see, section 5.8, infra^ for detailed discussion of perturbations usefiil for this invention). A 
table of data containing the gene expression measurements is used for cluster analysis. In 
order to obtain basis genesets that contain genes which co-vary over a wide variety of 
conditions, at least 10, preferably more than 50, most preferably more than 100 

30 perturbations or conditions are employed. Cluster analysis operates on a table of data which 
has the duncnsion mxk wherein m is the total number of conditions or perturbations and k 
is the number of genes measured. 

A number of clustering algorithms are useful for clustering analysis, Clustering ^ . 
algorithms use dissimilarities or distances between objects when forming clusters. In some 

35 embodiments, the distance used is Euclidean distance in multidimensional space: 
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10 



wbere/(5i,y;isthedistancebetweengaieA^andgene y (or between any other cellular 
constitiientsA^andy);A;and r, are gene expression response under pertuibation i. The 
Euclidean distance may be squared to place progressively greater weight on objects that are 
fiirOier apart Alternatively, the distance measure may be the Manhattan distance e.g., 
betweoi gene A* and Y, which is provided by: 



Again, and are gene expression responses under pertuibation i. Some other definitions 
of distances are Chebychev distance, power distance, and pacent disagreement. Percent 
disagreement, defined as I(x.y) = (number of A; * Y^/i, is particularly useful for the method 
15 of this invention, if the data for the dimensions are categorical m nature. Another usefiil 
distance definition, which is particularly useful in the context of ceUular response, is 
I » 1 - r, whwe r is the correlation coefficient between the response vectors X, Y, also called 
the normalized dot product A-'r/lAHll. Specifically, the dot product X'Yis defined by the 
equation: 

XY=^X,>iY, (6) 



20 



25 



30 



mi\X\'-iX'X)'^\Y\=iY'Yf. 

Most preferably, tiie distance measure is apprq)riate to the biological questions 
being asked, e.g., for identifying co-varying and/or co-regulated cellular constituents 
including co-vaiying or co-regulated g«ies. For exan^le, in a particularly preferred 
embodiment, the distance measure 7 = 1 - r with the correlation coefficient which 
comprises a weighted dot product of the genes X and Y. Specifically, in this preferred 
embodiment, is preferably defined by the equation 



r = 



r, 



1/2 



(7) 



where of^ and of^ are the standard errors associated with the measuremoit of genes Xaad 
Y, respectively, in experiment i. 
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The correlation coefficients of the normal and weighted dot products above are 
bounded between values of +1, which indicates that the two response vectors are perfectly 
correlated and essentially identical, and -1, which indicates that the two response vectors are 
"anti-correlated" or "anti-sense" (Le., are opposites). These correlation coeflficients are 

5 particularly preferable in embodiments of the invention where cellular constituent sets or 
clusters are sou^t of constituents which have responses of tfie same sign. 

In other embodiments, it is preferable to identify cellular constituent s^ or clusters 
which are co-regulated or involved in the same biological responses or pathways, but which 
comprise similar and anti-correlated responses. For example. Fig. 10 ilhistrates a cascade in 

10 which a signal activates a transcription factor which up-regulated several genes, identified 
as Gl, G2, and G3. In the example presented in Fig. 10. the product of G3 is a repressor 
element for several different genes, e.g., G4, G5, and G6. Hius, it is preferable to be able 
to identify all six gaies Gl-G6as part of the same cellular constituent set or cluster. In 
such embodiments, it is preferable to use the absolute value of either the normalized or 

15 weighted dot products described above, ic, |r|, as the correlation coefficient. 

In still ottier embodiments, the relationships between co-regulated and/or co-varying 
cellular constituents (such as genes) may be even more complex, such as in instances 
wherein multiple biological pathways {e.g., signaling pathways) converge on the same 
cellular constituent to produce different outcomes. In such embodiments, it is preferable to 

20 use a correlation coefficient r = Z'^" which is capable of identifying co-varying and/or co- 
regulated cellular constituents irrespective of the sign. The correlation coefficient specified 
by Equation 8 below is particularly useful in such embodiments. 



r=7 

25 



1/2 



(8) 



Various cluster linkage rules are usefiil for the methods of the invention. Single 
30 linkage, a nearest neighbor mefliod, determines the distance between the two closest 

objects. By contrast, complete linkage methods determine distance by the greatest distance 
between any two objects in the different clusters. This method is particularly usefiil in cases 
when genes or other cellular constituents form naturally distinct "cluraps."_ Alternatively, 
the unweighted pair-group average defines distance as the average distance between all 
35 pairs of objects in two different clusters. This method is also very usefiil for clustering 
genes or other cellular constituents to form naturally distinct "clumps." Finally, the 
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weighted pair-group average method may also be used. This method is the same as the 
unweighted pair-group average method except that the size of the respective clusters is used 
as a weight. This method is particularly useful for embodiments where the cluster size is 
suspected to be greatly varied (Sneath and Sokal,1973. Numerical taxonomy, San 
5 Francisco: W. H. Freeman & Co.). Other cluster linkage rules, such as the unweighted and 
weighted pair-group centroid and Ward's method are also useful for some embodiments of 
the invention. See., e.g.. Ward, 1963. T Am. Stat Assn. 58:236; Hartigan, 1975. Oustsmg 
algorithms. New Yoric: Wiley. 

In one particularly preferred embodiment, the cluster analysis is performed using the 
10 hdust routine (see, e.g.. *hclusf routme from the software package S-Plus, MathSoft, Inc., 
Cambridge, MA). An example of a chistering 'tree' output by the hclust algorithm of S- 
Plus is shown in Fig. 6 {see. also. Example 1, section 6.1, infra). The data set in this case 
involved 18 experiments including different drug treatments and genetic mutations related 
to the yeast S. cerevisiae biochemical pathway homologous to immunosuppression in 
15 humans. The set of more than 6000 measured mRNA levels was first reduced to 48 by 
selecting only those genes which had a response amplitude of at least a factor of 4 in at least 
one of the experiments. This initial downselection greatly reduces the confusing effects of 
measurement errors, which dominate the small responses of most genes in most 
experiments. Clustering using 'hclust' was then performed on the resulting 18 x 48 table of 
20 data, yielding the clustering tree shown in Fig. 6. When the number and diversity of 
experiments in the clustering set is larger, then the fraction of measured cellular constituents 
with significant responses (well above the measurement error level) is also laigw, and 
eventually most or all of the set of ceUular constituents are retained in the first down 
selection and become represented in the clustering tree. The genesets derived fcom the tree 
25 then more completely cover the set of cellular constituents. 

As the diversity of perturbations in the clustering set becomes very large, the 
genesets which are clearly distinguishable get smaUer and more numerous. However, it is a 
discovery of the inventors that even over very large experiment sets, there are smaU 
genesets that retain their coherence. These genesets are termed irreducible genesets. In 
30 some embodiments of the invention, a large number of diverse perturi)ations are applied to 
obtain such irreducible genesets. For example, Geneset No.l at the left in Figure 6 is found 
also when clustering is performed on a much larger set of perturbation conditions. A data 
set of 365-yeast conditions including Ae 1 8 previously mentioned was used for cluster , 
analysis. Perturbation conditions include drug treatinents at different concentrations and 
35 measured after different times of treatment, responses to genetic mutations in various genes, 
combinations of drug treatment and mutations, and changes in growtii conditions such as 
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temperature, density, and calcium concentration. Most of these conditions had nothing to 
do with the immunosuppressant drugs used in the 18-experiment set; however, the geneset 
retains its coherence. Genesets No. 2 and No. 3 also retain partial coherence. 

Genesets may be defined based on the many smaller branches in the tree, or a small 

5 number of larger branches by cutting across the tree at different levels - see the example 
dashed line in Fig. 6. The choice of cut level may be made to match the number of distinct 
response pathways expected. If little or no prior information is available about the number 
of pathways, then the tree should be divided into as many branches as are truly distinct 
Truly distinct* may be defined by a minimum distance value between the individual 

10 branches. In Fig. 6, this distance is the vertical coordinate of the horizontal comiector 
joining two branches. Typical values are in the range 0.2 to 0.4 where 0 is perfect 
correlation and 1 is zero correlation, but may be larger for poorer quality data or fewer 
experiments in the training set, or smaller in the case of better data and more experiments 
in the training set. 

15 

Preferably, *truly distmct' may be defined with an objective test of statistical 
significance for each bifurcation in the tree. In one aspect of the invention, the Monte 
Carlo randomization of the experiment index for each cellular constituent's responses 
across the set of experiments is used to define an objective test. 
20 In some embodiments, the objective test is defined in the following manner: 

Letpjy be the response of constituent k in experiment L Let Hfi) be a random 
permutation of the experiment index. Then for each of a large (about 100 to 1000) number 
of different random permutations, construct p^jf^y For each branching in the original tree, 
for each permutation: 

25 (1) perform hierarchical clustering with the same algorithm Chclust' in this case) 

used on the original unpemiuted data; 

(2) compute fractional unprovement /in the total scatter with respect to cluster 
centers in going from one cluster to two clusters 



30 



f^l-EDl'^/SD^^ (9) 



where is the square of the distance measure for constituent k with respect to the center 
(mean) of its assigned cluster. Superscript 1 or 2 indicates whether it is with respect to the. 
center of the entire branch or with respect to the center of the appropriate cluster out of the 
35 two subclusters. There is considerable freedom in the definition of the distance fiinction D 
used in the clustering procedure. In these examples. D^l-r, where r is the correlation 
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coefficirait between the responses of one constituent across tiie experiment set vs. the 

tespODses of the other (or vs. the mean cluster response). 

The distribution of fractional improvements obtained from the Monte Carlo 

procedure is an estimate of the distribution under the nuU hypothesis that particular 
5 branching was not significant. The actual fractional improvement for that branching with 

the unpermuted data is then compared to the cumulative probability distribution from the 

null hypothesis to assign significance. Standard deviations are doived by fitting a log 

normal model for the null hypothesis distribution. 

The numbers di^layed at the bifiircations in Fig. 6 are the significance, in standard 
10 deviations, of each bifurcation. Numbers greater than about 2, for example, indicate that the 

branching is significant at the 95% confidoice level 

]£, for example, the horizontal cut shown in Fig. 6 is used, and only those branches 

with more than two members below flie cut are accq)ted as genesets, three genesets are 

obtamed in Fig. 6. These three genesets reflect the pathways involving the calcineurin 
15 protein, the PDR gene, and the Gcn4 transcription factor. Therefore, genesets defined by 

cluster analysis have underlying biological significance. 

In more detail, an objective statistical test is preferably employed to determine the 
statistical reliability of the grouping decisions of any clustering method or algorithm. 
20 Preferably, a similar test is used for both hierarchical and non-hierarchical clustering 
methods. More preferably, the statistical test employed comprises (a) obtaining a measure 
of the conqjactness of the clusters determined by one of the clustering methods of tiiis 
invention, and (b) comparing the obtained measure of compactness to a hypothetical 
measure of compactness of cellular constituents regrouped in an increased number of 
25 clusters. For example, in embodiments wherein hieraicMcal clustering algorithms, such as 
hclust, are employed, such a hypothetical measure of compactness preferably comprises the 
measure of compactness for clusters selected at the n«ct lowest branch in a clustering tree 
(e.g., at LEVEL 1 rather than at LEVEL 2 in Fig. 11). Alternatively, in embodiments 
wherein non-hierarchical clustering methods or algorithms are employed, e.g., to generate N 
30 clusters, the hypothetical measure of compactness is prefend)ly the compactness obtained 
for AT+l clusters by the same methods. 

Cluster con^actaess may be quantitatively defined, e.g., as the mean squared 
distance of elements of the cluster from the "cluster mean," or, more preferably, as the . 
inverse of the mean squared distance of elements from the cluster mean. The cluster mean 
35 of a particular cluster is generally defined as the mean of the response vectors of all 

elements in the cluster. However, in certain embodiments, e.g., wherein the absolute value 
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of the noraialized or weighted dot product is used to evaluate the distance metric {Le., /= 1 
- |r| ) of the clustering algorithm, such a definition of cluster mean is problematic. More 
generally, the above definition of mean is problematic in embodiments wherein response 
vectors may be in opposite directions such that the above defined cluster mean could be 
5 zero. Accordingly, in such embodiments, it is preferable to chose a different definition of 
cluster compactness, such as, but not limited to, the mean squared distance between aU pairs 
of elements in the clustw. Altonatively, the cluster conq)actness m^ be defined to 
comprise the average distance (or more prefisrably the inverse of the average distance) fix>m 
each element ie.g., cellular constituent) of the cluster to all other elements in that cluster. 

10 Preferably, Step (b) above of comparing cluster compactness to a hypothetical 

compactness comprises generating a non-parametric statistical distribution for the changed 
compactness in an increased number of clusters. More preferably, such a distribution is 
generated using a model which mimics the actual data but has no intrinsic clustwed 
structures (i.c., a "null hypothesis" model). For example, such distributions may be 

15 generated by (a) randomizing the perturbation experiment index i for each cellular 
constituent X, and (b) calculating the change in compactness which occurs for each 
distribution, e.g., by increasing the number of clusters from to N+\ (non-hierarchical 
clustering methods), or by increasing the branching level at which clusters are defined 
(hierarchical methods). 

20 Such a process is illustrated in Fig. 12 for an exemplary, non-hierarchical 

embodiment of the clustering methods wherein the perturbation vectors are two- 
dimensional (Le., there are two perturbation experiment, i = 1, 2) and have lengths \X] =2. 
Their response vectors are therefore displayed in Fig. 12 as points in two-dimensional 
space. In the present example, two apparent clusters can be distinguished. These two 

25 chister are shown in Fig. 12A, and comprise a circular cluster and a dumbbell-shaped 
chister. The cluster centers are indicated by the triangle symbol (A). As is apparent to one 
skiUed in flie art, ±e distribution of pertuibation vectors in Fig. 12 could also be divided 
into three clusters, illustrated in Fig. 12B along with their corresponding centers. As will 
also be apparent to one skilled in the art, the two new chisters in Fig. 12B are each more 

30 compact than the one dumbbell shjq)ed cluster in Fig. 12A. However, such an increase in 
compactness may not be statistically significant, and so may not be indicative of the actual 
or unique cellular constituent sets. In particular, the compactness of a set ofN clusters may 
be defined in this example as the inverse of the mean squared distance of each element from 
itsclustercenter,i.e.,asl/2)^^. In general, Z)^*^^^, < Z)^^^„. Regardless of whether 

35 there are additional "real" cellular constituent sets. Accordingly, the statistical methods of 
this invention m^ be used to evaluate the statistical significance of the increased 
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compactness which occurs, eg., in the present example, when the number of clusters is 
increased from isr = 2 to = 3. 

In an exemplary embodiment, the increased compactness is given by the parameter 
£, which is defined by the formula 
5 7(^> « /(^+^) 



However, other definitions are apparent to those skilled in the art which may also be used in 
the statistical methods of this mvention. In general, the exact definition of £ is not crucial 
provided it is monotonically related to increase in cluster compactness. 

The statistical methods of this invention provide methods to analyze the significance 
of £. Specifically, these methods provide an empirical distribution approach for the 
analysis of £ by comparing the actual increase in compactness, Eq for actual experimental 
data, to an empuical distribution of £ values determined 6om randomly permuted data 
(eg., by Equation 10 above). In the two-dimensional example illustrated in Fig 12, such a 
translation conq)rises, first, randomly swapping the pertuibation indices i = 1,2 in each 
response vector with equal probability. More specifically, the coordinates the indices) 
of ttie vectors in each cluster being subdivided are **reflected" about the cluster center, eg., 
by first translating the coordinate axes to the cluster center as shown in Fig. 12C. The 
results of such an operation are shown, for the two-dimensional example, in Fig. 12D. 
Second, the randomly permuted data are re-evaluated by the cluster algorithms of the 
invention, most preferably by the same cluster algorithm used to determine the original 
cluster(s), so that new clusters are determined for the permutated data, and a value of £ is 
evaluated for these new clusters (z.e, for splitting one or more of the new clusters). Steps 
one and two above are repeated for some number of Monte Carlo trials to generate a 
distribution of £ values. Preferably, the number of Monte Carlo trials is from about 50 to 
about 1000, and more preferably from about 50 to about 100. Finally, the actual increase in 
compactness, z.e, Egy is compared to this empirical distribution of £ values. For example, if 
M Monte Carlo simulation are performed, of which x have £ values greater than Eg, then the 
confidence level in the number of clusters may be evaluated from 1-x/M In particular, if 
M = 100 and ;c = 4, then the confidence level that there is no real significance in increasing 
the number of clusters is 1 - 4/100 = 96%. 

ITfie above methods are equally applicable to cmbodhnents comprising hierarchical 
clusters and/or a plurality of elements (eg., more than two cellular constiturats). For 
example, the cluster tree illustrated in Fig. 11. This clustering tree was obtamed using the 
hclust algorithm for 34 perturbation response profiles comprising 1 85 cellular constituents 
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which had significant responses. Using the clusters defined by the branches at LEVEL 2 of 
this tree, 100 Monte Carlo simulations were performed randomizing the 34 experimental 
indices and empirical distributions for the improvements in compactness E were generated 
for each branching in the tree. The actual improvemaat in compactness Eo at each branch 

5 was compared with its corresponding distribution. These comparisons are shown by the 
numbers at each branch in Fig. 1 1 . SpecificaUy, these mmibers indicate the number of 
standard deviations in the distribution by which E„ exceed the average value of £. The 
indicated significance coirespond weU with the independently detennined biological 
significance of ttie branches. For example, the naain branch indicated in Fig. 7 by ttie 

10 number five (bottom label) comprises genes regulated via the calcineurin protein, whereas 
the branch labeled number 7 primarily comprises genes regulated by the Gcn4 transcription 
fiu:tor. 

Further, although the Monte Carol methods of the present invention are described 
above, for exemplary purposes, in teraas of the permutation of a perturbation index /, it is 

15 readily appreciated by those skilled in the art that such methods may also be used by 
peraniting any index of biological response data which is independent of the cellular 
constituent index. For example, in some anbodiments the response profile data for cellular 
constituent A'may be a fimction of time, e.g. , Xit), with a time index / in addition to or in 
place of a perturbation index. In such embodiments, the Monte Carlo methods of this 

20 invention may also be used by permuting the time index /. 

Another aspect of the cluster analysis metfiod of this invention provides the 
definition of basis vectors for use in profile projection described in the following sections. 
A set of basis vectors V has * x « dimensions, where k is the numbn of gmes and n 
25 is the number of genesets. 

>0) . yi")- 



V = 



y(n . yi")^ 



(11) 



30 



is the amplitude contribution of gene index k in basis vector «. In some embodiments, 
= 7, if gene it is a member of geneset «, and F^"^* = 0 if gene k is not a member of 

geneset h. In some embodiments, F*** is proportional to the response of gene * in geneset n 

over the training data set used to define the genesets . 

In some preferred embodiments, the elements F^*^* are normalized so that each basis 

vector F^"^ has unit length by dividing by the square root of the number of genes in geneset 
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w. This produces basis vectors which are not only orthogonal (the genesets derived from 
cutting the clustering tree are disjoint), but also orthonormal (unit length). With this choice 
of normalization, random measurement errors in profiles project onto the in such a way 
that the amplitudes tend to be comparable for each n. Normalization prevents large 
S genesets from dominating the results of similarity calculations. 



5.2.3. QgNE$ET CLAggTOATON PASED IJPQN 
MECHANISMS OF REGULATION 
Genesets can also be defined based upon the mechanism of the regulation of genes. 

10 Genes whose regulatory regions have the same transcription factor binding sites are more 
likely to be co-regulated. In some preferred embodiments, the regulatory regions of the 
genes of interest are compared using multiple alignment analysis to decipher possible shared 
transcription factor binding sites (Stormo and Hartzell,1989, Identifying protein binding 
sites from unaligned DNA fragments, Proc Natl Acad Sci 86: 1183-11 87; Hertz and Stormo, 

15 1995, Identification of consensus patterns in unaligned DNA and protein sequences: a large- 
deviation statistical basis for penalizing gaps, Proc of 3rd Intl Conf on Bioinformatics and 
Genome Research, Lim and Cantor, eds.. World Scientific Publishing Co., Ltd. Singapore, 
pp. 201-216). For example, as Example 3, zn^a, shows, common promoter sequence 
responsive to Gcn4 in 20 genes may be responsible for those 20 genes being co-regulated 

20 over a wide variety of perturbations. 

The co-regulation of genes is not limited to those with binding sites for the same 
transcriptional factor. Co-regulated (co- varying) genes may be in the up-stream/down- 
stream relationship where the products of up-stream genes regulate the activity of down- 
stream genes. It is well known to those of skill in the art that there are numerous varieties of 

25 gene regulation networks. Oneof skill in the art also understands tfiat the methods of this 
invention are not limited to any particular kind of gene regulation mechanism. If it can be 
derived from the mechanism of regulation that two genes are co-regulated in terms of their 
activity change in response to perturbation, the two genes may be clustered into a geneset. 
Because of lack of complete understanding of the regulation of genes of interest, it is 

30 often preferred to combine cluster analysis with regulatory mechanism knowledge to derive 
better defined genesets. For example, in some embodiments statistically significant 
genesets identified in cluster analysis are compared to biologically significant genesets, e.g., 
that are identified in regulatory mechanism studies. In some preferred embodiments, K- 
means clustering may be used to cluster genesets when the regulation of genes of interest is 

35 partially known. K-means clustering is particularly usefiil in cases where the number of 
genesets is predetermined by the understanding of the regulatory mechanism. In general, K- 
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mean clustering is constrained to produce exactly the number of clusters desired. Therefore, 
if promoter sequence comparison indicates the measured genes should fall into three 
genesets, K-means clustering may be used to generate exactly three genesets with greatest 
possible distinction between clusters. 

5.2.4. pFFTNPMRNT OF GENESETS AND GENESF T nRFTNlTION DATABASE 
Genesets found as above may be refined with any of several sources of conoborating 
information including searches for common regulatory sequence patterns, litaature 
evidence for co-regulation, sequence homology, known shared function, etc. 
10 Databases are particularly useful for fee refinement of genesets. In some 

embodiments, a database containing raw data for cluster analysis of genesets is used for 
continuously updating geneset definitions. HG. 3 shows one embodiment of a dynamic 
geneset dstfabase. Data fiom perturbation experiments (301) are input into data tables (302) 
in the perturbation database management system (308). Geneset definitions, in the form of 
15 basis vectors are continuously generated based upon the updated data in perturbation 
database using cluster analysis (303) and biological pathway definitions (305, 306). The 
resulting geneset definition datatable (304) contains updated geneset definitions. 

The geneset definitions are used for refining (307) the biological pathway datatables. 
The geneset definition tables are accessible by user-submitted projection requests. A user 
20 (313) can access the database management system by submitting e3q)ression profiles (311). 
The database management system projects (310) the expression profile into a projected 
expression profile (see, section 5.3, infra, for a discussion of the projection process). The 
user-submitted expression profile is optionally added to the perturbation data tables (302). 
This dynamic database is constantly productive in flie sense that it provides usefiil 
25 geneset definitions wiflitiie first, and limited, set of perturbation data. The dynamically 
updated database continuously refines its geneset definitions to provide more usefiil geneset 
definitions as more pwtuibation data become available. 

In some embodiments of the dynamic geneset definition database, tiie perturbation 
data and geneset definition data are stored in a series of relational tables in digital computer 
30 storage media. Preferably, tiie database is implemented in distributed system environments 
with cUent/server raiplementation, allowing multiuser and remote access. Access control 
and usage accounting are implemented in some embodiments of tiie database system. 
Relational database management systems and client/server environments are well 
documented in tiie art (Nath, 1995, Th^ Gi.ide to SOL Server. 2™* ed., Addison-Wesley 
35 Publishing Co.). 
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5.3. RRPRESFNTATION OF GENF- RXPRESSIO N PROFILES 
RASED UPON BASIS GENESETS 
One aspect of the invention provides methods for converting the expression value of 
genes into the expression value for genesets. This process is refiared to as projection. In 
S some embodiments, the projection is as follows: 



P = [Pi,..Pi...Pn]= p*V 



(12) 



wherein, p is the expression profile, P is the projected profile, P, is expression value for 
geneset i and Fis a predefined set of basis vectors. The basis vectors have been previously 
defined in Equation 7 (Section 5.2.2, supra) as: 

>0) 



15 



(13) 



wherein V'^ is the amplitude of cellular constituent index k of basis vector n. 

In one preferred embodiment, the value of geneset egression is simply the average 
of the expression value of the goies within the geneset In some other embodiments, the 
avCTage is weighted so that highly expressed genes do not dominate the geneset value. The 
collection of the wcpression values of the genesets is the projected profile. 

5.4. APPLICATION OF PROJECTED PROFILES 
The projected profiles, i.e., biological state or biological re^nses «q)ressed in 
temis of genesets, offer many advantages. This section discusses another aspect of this 
invention which provides methods of analysis utilizing projected profiles. 

5.4.1. APVANTAGP OF THE PROJECTED PROFILE 
One advantage of using projected profiles is that projected profiles are less 
vulnerable to measurement errors. Assuming independent measurement errors in the data 
for each cellular constituent, the firactional standard error in the projected profile element is 
^^proximately M„""^ times the average fi^ictional standard error for the individual cellular 
constituents, where M„ is the number of cellular constituents in the n'th geneset. Thus if the 
average up or down-regulation of the cellular constituents is significant at x standard 
deviations, th«i the projected profile element will be significant at M,*° x standard 
deviations. This is a standard result for signal-to-noise ratios of mean values; averaging 
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makes a tremendous difference in the probabilities of detection vs. false alarm {see. e.g.. Van 

Trees, 1968, Detection, Estimation, and Modulation Theory Vol I, Wiley & Sons). 

Another advantage of the projected profiles is the reduced dimension of the data set 

For example, a 48 gene data set is represented by three genesets (example 2) and 194 gene 
5 data set is represented by 9 genesets (example 3). This reduction of data dimension greatly 

focilitates the analysis of profiles. 

Yet another advantage of the projected profiles is that projected profiles tend to 

capture the underlying biology. For example, FIG. 6 shows a clustering tree of 48 gaies. 

Three genesets which correspond to three pa&ways involving the calcineurin protein, ttie 
10 PDR gene, and the Gcn4 transcription fector, re^ectively, are identified (Example 1, infra). 

5.4.2. PROFILE COMPARISON AN H TT ASSTnCATION 
Once the basis genesets are chosen, projected profiles P, may be obtained for any set 
of profiles indexed by /. Similarities between the P, may be more clearly seen than between 
15 the original profiles p, for two reasons. First, measurement errors in extraneous genes have 
been excluded or averaged out Second, the basis genesets tend to capture the biology of the 
profiles Pi and so are matched detectors for their individual response components. 
Classification and clustering of the profiles both are based on an objective similarity metric, 
call it S, where one usefiil definition is 



20 



Sy''S(p,.Pj} 'Pj/m\Pj\) (14) 



This definition is the generalized angle cosine between Ae vectors P, and Pj. It is flie 
projected version ofthe conventional correlation coefficient between/7, andp^. Profile pj is 
25 deemed most similar to that other profile pj for which 5^ is maximum. New profiles may be 
classified according to their similarity to profiles of known biological significance, such as 
the response patterns for known drugs or perturbations in specific biological pathways. Sets 
of new profiles may be clustered using the distance metric 



30 



D„-l-S, (15) 



where this clustering is analogous to clustering in the original larger space of the entire set of 
response measurements, but has the advantages just mentioned of reduced measurement ^ 
error effects and enhanced capture of the relevant biology. 
35 The statistical significance of any observed similarity Sy may be assessed using an 

empirical probability distribution generated under the null hypothesis of no correlation. This 
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distribution is generated by perfonning the projection. Equations (9) and (10) above, for 
many different random permutations of the constituent index in the original profile p. 

That is, the ordered set are replaced by pjjf^ where I^) is a permutation, for -100 
to 1000 different random permutations. The probability of the similarity 5^ arising by 
5 chance is then the fraction of these permutations for which the sunilarity (permuted) 
exceeds the similarity observed using the original unpermuted data. 

5.4.3. n J J JSTRATIVF DRUG DISCO VERY APPLICATIONS 
One aspect of the invention provides methods for drag discovery. In one 
10 embodiment, genesets are defined using cluster analysis. The genes within a geneset are 
indicated as potentially co-regulated under the conditions of interest. Co-regulated genes are 
fiirther explored as potentially being involved m a regulatory pathway. Identification of 
genes involved m a regulatory pathway provides useful information for designing and 
screening new drags. 

1 5 Some, embodiments of the invention employ geneset definition and projection to 

identify drag action pathways. In one embodiment, the expression changes of a large 
number of genes in response to the apphcation of a drag are measured. The expression 
change profile is projected into a geneset expression change profile. In some cases, each of 
the genesets represents one particular pathway with a defined biological purpose. By 

20 examining the change of genesets, the action pathway can be deciphered. In some other 
cases, the expression change profile is compared with a database of projected profiles 
obtained by perturbing many different pathways. If the projected profile is similar to a 
projected profile derived from a known perturbation, the action pathway of the drag is 
indicated as similar to the known perturbation. Identification of drag action pathways is 

25 useful for drag discovery. See, Stoughton and Friend, Methods for Identifying pathways of 
Drag Action, U.S. Patent Application No. 09/074,983, previously incoqwrated by reference. 

In some embodiments of the invention, drag candidates are screened for their 
therapeutic activity (See, Friend and Hartwell, Drag Screening Method, U.S. Provisional 
Apphcation No. 60/056,109, filed on August 20, 1998, previously incorporated by reference 

30 for all purposes, for a discussion ofdrag screening methods). In one embodiment, desired 
drag activity is to affect one particular genetic regulatory pathway. In this embodiment, 
drag candidates are screened for their ability to affect the geneset corresponding to the 
regulatory pathway. In another embodiment, a new drag is desired tareplace an existing 
drag. In this embodiment, the projected profiles ofdrag candidates are compared with that 
35 of the existing drag to determine which drag candidate has activities similar to the existing 
drag. 
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In some embodiments, the method of the invention is used to decipher pathway 
arborization and kinetics. When a receptor is triggered (or blocked) by a ligand, the 
excitation of the downstream pathways can be different depending on the exact temporal 
profile and molecular domains of the Ugand interaction with the receptor. Simple examples 

5 of the differing effects of different ligands are the phenotypical differences that arise 
between responses to agonists, partial agonists, negative antagonists, and antagonists, and 
that are expected to occur in response to covalent vs. noncovalent binding and activation of 
different molecular domains on the receptor. Sec, Ross, Pharmacodynamics: Mechanisms 
of Drug Action and the Relationship between Drug Concentration and Effect, in 31ie 

10 Phannacoloeical Basis of Therapeutics (Oilman et al ed.). McGraw Hill, New York, 1996. 
FIG. 4A illustrates two different possible responses of a pathway cascade. 

In some embodiments of the invention, ligands for G protein-coupled receptors 
(GPCRs) or othCT receptors may be investigated using the projection method of the invention 
to simplify the obsoired temporal responses to recq)tor interactions over the responding 

15 genes. In some particularly preferred anbodiments, the graiesets and temporal profiles 
involved are discovered. The profile of temporal responses of a large number of genes are 
projected onto the predefined genesets to obtain a projected profile of temporal responses. 
The projection process simplifies the observed responses so that differrait tranporal 
responses may be detected and discriminated more accurately. 

20 Figure 4B gives an example of clustering of genes by their temporal response 

profiles across several time points. The experiment here was activation of the yeast mating 
pathway (same strains, methods, etc. as described earlier) with the yeast a mating 
pheromone. Expression responses for all yeast genes ratioed to control (mock treatmrait) 
baseline were measured immediately after treatment, and at 15 minutes after treatment, 30, 

25 45, 60, 90. and 120 minutes after treatment This time series of experiments provided the 
expaiment set for clustering analysis. Each line represents one experiment. A line with an 
asterisk represents an experimoit fliat was given low weight in clustering opoation. Three 
of the main cluster groups are Ulustrated in HG. 4B, showing systematically distinct 
tenqKMsd behavior. The first group (early) is responding to the STE12 transcription ftctor, 

30 the second groiq) (adaptive) contains members of the main signaling pathway such as STE2 
and STE12 itself that fiitigue (diow decreasing req)onse) with continued treatment, and the 
third group (cell cycle) is associated with the cell cycle perturbations inflicted by the mating 

refuse.- — - 

It is possible to define augmented basis vectors whose indices cover constituents and 
35 time points. Projection onto these basis vectors picks out the amplitudes of response in 
specific gene groups and of specific temporal profiles. Thus, for example, we could 
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efficiently detect responses such as those shown in the third group in FIG. 4B by projecting 
a time series of expression profiles onto an augmented basis vector whose elements were 
nonzero only for the genes included in the third group, and whose nonzero amplitudes varied 
over the time index according to the average of the temporal response in the third group. 

5 

5.4.4. ILLUSTRATIVE DIAGNOSTIC APPLICATIONS 
One aspect of the invention provides methods for diagnosing diseases of humans, 
animals and plants. Those methods are also usefiil for monitoring the progression of 
diseases and the effectiveness of treatments. 

10 In one embodiment of the invention, a patirat cell sample such as a biopsy fix>m a 

patient's diseased tissue, is assayed for the expression of a large number of genes. The gene 
expression profile is projected into a profile of geneset expression values according to a 
definition of genesets. The projected profile is then compared with a reference database 
containing reference projected profiles. If the projected profile of the patient matches best 

IS with a cancer profile in the database, the patient's diseased tissue is diagnosed as being 
cancerous. Similarly, when the best match is to a profile of another disease or disorder, a 
diagnosis of such other disease or disorder is made. 

In another embodiment, a tissue sample is obtained from a patient's tumor. The 
tissue sample is assayed for the expression of a large number of genes of interest. The gene 

20 expression profile is projected into a profile of geneset expression values according to a 
definition of genesets. The projected profile is compared with projected profiles previously 
obtamed from the same tumor to identify the change of expression in genesets. A reference 
library is used to determine whether the geneset changes indicate tumor progression. A 
similar method is used to stage other diseases and disorders. Changes of geneset expression 

25 values in a profile obtained tcom a patient under treatment can be used to monitor the 
effectiveness of the treatment, for example, by comparing the projected profile prior to 
treatment with that after treatment. 

S.4.5. RESPONSE PROFILE CLASSEFIC ATION BY CLUSTER ANALYSIS 
30 The methods of the present invention are not simply limited to grouping cellular 

constituents, such as genes, according to their degrees of co-variation by co- 
regulation). In particular, the cluster analysis and other statistical classification methods 
described above to analyze the co-variation of cellular constituents may also be used to . 
analyze biological response profiles and to group or cluster such profiles according to the 
35 similarity oftheir biological responses. Thus, for example, whereas Section 5.2.2 above 
describes methods for analyzing cellular constituent ^Vectors" A'= {ATJ where i is the 
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response profile index, the methods and equations described in Section 5.2.2 may also be 
used to analyze response profile vectors /"^ = {v/*^} where m is the response profile index, 
and I is the cellular constituent index. 

Such analyses may be performed, e.g., using the exact same clustering algorithms, 
5 including 'hclust,' as described in Section 5.2.2 above, and using the exact same distance 
metrics. For example. Section 5.2.2 above describes using the distance metric /= 1 - r, 
MvbsK r is the normalized dot product .y r/Wlr I for the comparison of ceUular constituent 
vectors A'and Y. As is readily apparent to those skilled in the art, the same distance metric 
may also be used to evahxate response profile vectors and v*^, by evaluating 
IQ r = v'^-i/'^/ I k/^ L Similar application of the other aspects of flie clustering methods 
described above in Section 5.2.2, including the otiier distance metrics and the significance 
tests, are also apparent to those drilled in the art and may be used in the present invention. 

The analytical methods of tins invention flius include methods of "two-dimensional" 
cluster analysis. Such two-dimensional cluster analysis metiiods simply comprise (1) 
15 clustering cellular constituents into sets tiiat are co-varying in biological profiles, and (2) 
clustering biological profiles into sets that effect similar cellular constituents (preferably in 
similar ways). The two clustering steps may be performed in any order and according to tiie 
methods described above. 

Such two-dimensional clustering techniques are useful, as noted above, for 
20 identifying sets ofgenes and perturbations of particular interest. For example, the two- 
dimensional clustering techniques of this invention may be used to identify sets of cellular 
constituents changes in levels of expression or abundance) and/or experiments tiiat are 
associated with a particular biological effect of interest, such as a drag effect or a particular 
disease or disease state. The two-dimensional clustoing techniques of this invention may 
25 also be used, e.g., to identify sete of cellular constituents and/or experiments that are 
associated witii a particular biological patiiway of interest. 

Still further, tiie above described two-dimensional clustering techniques can be used 
to identify pertinbations tiiat cause changes in tiie levels of expression or abundance of 
particular cellular constituents of interest or in particular co-varying sets of cellular 
30 constituents (e.^., particular genesets) of interest. For example, in one preferred 

embodiment of tiie invention, such sets of cellular constituents and/or pertinbations are used 
to detwmine consensus profiles for a particular biological response of interest. In otiier 
embodiments, identification of such sets of cellular constitiients and/or experiments provide 
more precise indications of groupings cellular constitiients, such as identification ofgenes 
35 involved in a particular biological patiiway or response of interest. 
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Accordingly, another preferred embodiment of the present invention provides 

methods for identifying cellular constituents, particularly genes (eg., new genes) or 

genesets, whose change {e.g., in levels of expression or abundance) is associated witii and/or 

involved in a particular biological effect of interest e.g., a particular biological pathway, the 

5 effect of one or more drugs, a particular disease or disease state or, alternatively, a particular 

treatment or flierapy {eg., a particular drug treatment or drug ther^y). Sudi cellular 

constituents are identified according to the cluster-analysis methods described above. Such 

cellular constituents {e.g., genes) may be previously unknown ceUular constituents, or 

known cellular constituwits that were not previously known to be associated with Ae 

10 biological effect of intoest 

ConsidOTing, for example, the particular embodiment of identifying cellular 

constituents associated with a disease or disease state, using the two-dimensional clustering 
methods described hereinabove biological profiles that cluster with perturbations associated 
wifli a particular disease or disease state can be identified and examined to identify cellular 

15 constituents and/or cellular constituent sets (e.g., genesets) that consistently change {e.g., in 
levels of expression or abundance) within such profiles. Such cellular constituents are usefiil 
as markers (e.g., genetic markers in the case of genes and genesets) for the particular disease 
or disease state. In particular, changes in such markers (e.g., in their level of expression or 
abundance) observed in a biological sample obained, e.g., firom a patient, can therefore be 

20 used to diagnose the particular disease or disease state in that patient Those cellular 
constituents tixat are particularly useful as markers (e.g., of a disease or disease state), and 
are therefore prefOTed in the present invention, are those cellular constituents that change 
(e.g., in their level of expression or abundance) in perturbations associated with a particular 
biological effect (e.g., a particular disease or disease state) of interest but do not change in 

25 otiier perturbations; in perturbations that are not associated with die particular 
biolo^cal effect of interest. 

The present mvention further provides metiiods for the iterative refinement of 
celhilar constituent sets and/or clustars of response profiles (such as consensus profiles). In 
particular, dominant features in each set of cellular constituents and or profiles identified by 

30 the cluster analysis metiiods of this invention may be blanked out, e.g., by setting tiieir 
elements to zero or to tiie mean data value of the set. The blanking out of dominant features 
may done by a user, e.g., by manually selecting features to blank out, or automatically, e.g., 
by automatically blanking out those elements whose response amplitudes Me above a 
selected tiireshold. The cluster analysis methods of the invention are then re^plied to the 
35 cellular constituent and/or profile data. Such iterative refinement metiiods may be used, e.g.. 
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to identify other potentially interesting but more subtle cellular constituent and/or 
experiment associations that were not identified because of the dominant features. 

More generally, and as is also apparent to those skilled in the art, the clustering 
methods of this invention may be used to cluster each dimension of any N-dimensional array 

5 of biological (or other) data, wherein N may be any positive integer. For example, in some 
embodiments, the biological data may comprise matrices tables) of values (/) which 
describe the change of cellular constituent i in response to pertiubation w after a time The 
clustering methods of the present invention may be used, in such embodiments, to cluster (1) 
the cellular constituent index i, (2) the pertiiibation response index nt, and (3) tiie time index 

10 Other embodiments are also sq)parrait to tiiose skilled in the ait 

5.4.6. BFMOVAL OF PgnFn.B ARTIFACTS 
The projection methods of the present invention, including tiie metiiods described in 
Section 5.2 above, may also be used to remove unwanted response components (z.e , 

1 5 "artifacts") &om biological profile data. Frequentiy, when such profile data are obtained 
there are one or more poorly controlled variables which lead to measured patterns of cellular 
constituents (e.g., measured gene expression patterns) which are, in fact, artifacts of the 
measurement process and are not part of the actual biological state or response (such as a 
perturbation response) being measured. Exemplary variables which may produce artifacts in 

20 biological profile data include, but are by no means limited to, cell culture density and 
temperature and hybridization temperature, as well as concentrations of total RNA and/or 

hybridization reagents. 

For example, Di Risi et al. (1997, Science 275:680-686) describe measurements 
using microarrays of S. cerevisiae cDNA levels during the change fiom anaerobic to aerobic 

25 growtii(i.e., tiie "diauxic shift"). However.if one of two nominally identical cell cultures 
has unintentionally progressed fiirther into tiie diauxic shift tiian tiie otiier, ttieir expression 
ratios wiU reflect tiiat transcriptional changes associated witii tiiis shift. Such artifacts 
potentially confuse tiie measurements of die true transcriptional responses being sought 
These artifiwts may be ♦'projected out" by removing or suppressing tiieir patterns in die data. 

30 In preferred embodiments, tiie artifact patterns in tiie data are known. In general, 

artifaa patterns may be determined fi»m any source of knowledge of tiie genes and relative 
amplitudes of response associated witii such artifacts. For example, tiie artifact patterns may 
be derived from experiments witii intentional perturbations of tiie suspected causative , 
variables. In anotiier embodiment, tiie artifact patterns may be determined fix)ra clustering 

35 analysis of control experiments where tiie artifacts arise spontaneously. 
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In such preferred embodiments, the contribution of known artifacts may be solved 
for and subtracted from the measured biological profile p = { }, e.g., by determining the 
best scaling coefficients for the contribution of artifact n to the profile. Preferably, the 
coefficients are found by determining the values of which minimize an objective 
function of the dififerrace between the measured profile and the scaled contribution of the 
artifacts. For example, the coefficients may be detcnnined by the least square 
minimization 

, 2 
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wherein A^^ is the amplitude of artifact n on the measurement of cellular constituent /. w, is 
an optional weighting factor selected by a user according to the relative certamty or 
significance of the measured value of cellular constituent i {i.e., ofp). 

The "cleaned" protifile in which the artifacts are effectively removed, is then 
given by the equation 
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wherein the coefficients a„ are determined, e.g., fix>m equation 16 above. 

In other embodhnents, the profile p may be compared to a library of artifact 
signatures A, = { A^j } of different severity. In such embodiments, the "cleaned" profile is 
determined by pattern matching agamst this library to determine the particular template 
which has greatest similarity to the profile p. In such embodiments, the cleaned profile is 
given by p/**^^ = Pf, - A,^ wherein the signature A, is deteraimed, e.g., by solving the 
equation 
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5.4.7. PROJECTED TITRATION CURVES 
In many instances, it may be desirable to measure the response of a biological system 
to aplurality of graded levels of exposure to a particular pertuibation. For example, during 
the process of drug discovery, it is often necessary or desirable to measure the response of a 
biological systan to graded levels of exposure to a particular drug or drug candidate, e.g., to 
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detennine the therapeutic and/or toxic effects of the drug or drug candidate. In other 
instances, it may be desirable to measure the effect on a biological system, e.g., of graded 
expression of a particular gene or gene product, such as by the methods described in Section 
5.8.1 below. For example, Fig. 13 shows the transcriptional responses of the largest 

5 responding genes of 5. cerevisiae to different concentrations of the drug FK506, as described 
by Marton et al, 1998, Nature Medecine ¥:1293.130l). 

The methods of the present invention may also be used to project such "titration 
responses** onto co-varymg cellular constituent sets, such as onto genesets. Such 'titration 
responses** typically comprise a plurality of biological responses at graded levels of exposure 

10 to a particular pertuibation graded levels of exposure to the drug FK506, as illustrated 
in Fig. 13). Thus, projected titration responses may be generated by projecting the 
biological response profile obtained at each level of the perturbation at each 
concentration of the drug) according to any of the methods described above in Sections 5.2 
and 5.3. For example, Fig. 15 shows the projected titration response curves of Fig. 13. In 

15 fliis particular example, the projection comprises averaging the response of each geneset 
with normalization such that the length of each basis geneset is unity, as described, eg. , in 
Section 5.3 above. 

In preferred embodiments, the projected titration responses are interpolated, e.g., by 
fitting to some model fimction of the perturbation. For example, in Fig. 14 the projected 
20 titration response curves have been fit to Hill Functions of the form shown in Equation 3 
above. However, other model function known in the art may be used. Alternatively, the 
projected titration response curves may be interpolated by means of spline-fitting, wherein 
each projected titration curve is interpolated by summing products of an appropriate spline 
interpolation fimction S multiplied by the measured data values, as provided by the equation 

25 P{u)^Y.S{u-u,)P(u,) (19) 

The variable "w*' refers to an arbitrary value of the perturbation (e.g., the drug exposure level 
or concentration) where the projected titration response P is to be evaluated. The variable 

refers to discrete values of the perturbation at which response profiles were actually 
measured. In general, S may be any smooth, or at least piece-wise continuous, fimction of 
hmited support having a width characteristic of the structure expected in the projected 
titration response fimctions. An exemplary width can be chosen to be the distance over 
which the projected titration response fimction being interpolated rises from 10% to 90%'6f 
its asymptotic value. Exemplary S fimctions include Imear and Gaussian interpolation. 

Compared to the confiising tangle of curves in Fig. 13, it is clear from the projected 
geneset titration responses shown in Fig. 14 that certain genesets respond at different critical 
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concentrations of FK506 (given by in Equation 3), and with different power law exponent 
(b in Equation 3) than do other genesets. Fig. 15 shows the contours of chi-squared plotted 
around the values of the two Hill coefficients (ug and « in Equation 3) derived for each 
geneset. The plot shows that the apparent visual distinctions in Fig. 14 are statistically 
5 significant. Specifically, the Hill coefficients are distinguished in both their sharpness (Le., 
the power law exponrait n, vertical axis) and in their critical concentrations (/.c, Ug, 
horizontal axis). Thus, individual genesets may be distinguished, e.g., according to the form 
of their titration responses. 

As expected, the different genesets in a titration response profile are also biologicaMy 
10 significant. For example, supporting experiments using FK506 in gene deletion strains of 5. 
ceremiae and tiie analysis of gene regulatory sequence regions show ttiat the geneset 
identified m Fig. 14 for the titration response of S. cerevisiae to FK506 have biological 
identities (see Marton et al, supra). These identities are indicated by the annotations in Fig. 
14. Thus, the titiation behaviors of different goiesets are also indicative of different 
1 5 biological pathways. For example, the curves labeled "GCN4-dependent" in Fig. 14 are 
responses of the sets of genes whose responses are mediated via the transcription factor 
protein Gcn4 (see, Marton et al, supra), while the gentier responses in Fig. 14, labeled 
"GCN4-independent" are for the sets of genes which response to FK506 whetiier or not tiie 
calcineurin or Gcn4 proteins are present. 
20 In other instances, it may be desirable to measure the state of a biological sample 

over a time interval. In particular, it is often desirable to monitor tiie changing biological 
state of a sample that occurs over time, e.g. , in association with a particular biological 
process or effect. Such biological processes may include, but are by no means limited to, 
meiosis, mitosis, and ceU differentiation. Changes in tiie biological state of a sample that 
25 occur over a time interval may also include changes in response to a particular pertiirbation 
such as exposure to one or more drugs, or a change in the environment. Monitoring changes 
of the biological state of a sample ova: time may simply con^rise a plurality of 
measurements of ttie time interval during which tiie biological process or effect of interest 
occurs. The meflmds of tiie present invwition may be used to project such "temporal 
30 measurements" of tiie biological state onto co-varying cellular constitiient sets such as onto 
genesets. In particular, as is appsaxat to tiiose skilled in tiie art, such temporal measurements 
may be analyzed according to tiie mefliods described above for measuring titiration 
responses." - - 



35 
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5.4.8. T TSF OF GENESETS IN MirROARRAYS 
The genesets of the present invention are also useful in the design and preparation of 
microarrays. In particular, using the methods of the invention a skilled artisan can readily 
select and prq>are probes for a microarray wherein the micioairay contains specific 

5 individual probes for less than all the goies in the genome and less tiian all the genes in a 
geneset. In such embodiments, the micioairay contains one or two or more individual 
probes, each of which hybridizes to an expression product mRNA, or cDNA or cRNA 
derived therefrom) within a single geneset for a desired number of genesrts. Thus, for 
exan^)le, changes in the e^qniession of all or most of the genes in the entire genome of a cell 
10 or organism can thereby be monitored by use of a surrogate and <m a single microarray by 
measuring expression of the group of genesets that are representative of all or most of the 
genes of the goiome. Such microarrays can be prq>ared, e.g., as described in Section 5.7, 
below, using the selected probes and are therefore part of the present invention. 

For example, in preferred embodiments, genesets are identified, as described in the 
15 above sections, for a biological sample (e.g., a cell or organism) of interest. In general, the 
number of genesets identified and for which probes appear in a microarray can be anywhere 
from 50 to 1,000. Preferably, however, the number of genesets for which probes appear in a 
microairay will be fewer than 500, more preferably from 100 to 500, and still more 
preferably from 100 to 200. Representative genes are then selected from each gaieset 
20 identified, and probes are prepared that hybridize to the nucleotide sequence of each 
representative gene. Preferably, no more than ten representative genes are selected from 
each geneset More preferably, however, the number of representative genes selected fixwn 
each geneset for which probes appear on the microarray is no more flian five, no more than 
four, no more than three or no more than two. to feet, most preferably only a single 
25 representative gene is selected from each geneset for which one or more probes ^ear on 
the microarray. For at least one geneset, and preferably for most or all of the genesets, the 
number of representative genes for which probes appear on the microarray is less than the 
total number of genes in the geneset. In certain preferred embodiments, at least one 
representative gene for which probes q)pear on the microanay is selected from aU of the 
30 genesets identified for the cell or organism. In other embodiments, the representative genes 
for which probes appear on the microairay are selected solely from genesets that are 
associated with one or more particular biological states of interest. For example, in certain 
embodiments, the representative genes are selected from genesets associated with a 
particular disease or disease state. In otiier embodiments, the representative genes are 
35 selected from genesets whose change is expression is associated with a particular drug or 
with a particular therapy including, for example, genesets whose change is expression is 
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associated with drug or therapeutic efficacy or genesets whose change in expression is 
associated with drug resistance or therapeutic failure. Thus, for example, in certain 
embodiments the total number of genesets for which probes are presrait on a microanay is 
less than 1,000, less than 500, less than 200, less than 100, less than 50, less than 30, less • 
5 dian 20, or less than 10. ^ 

5.5. rnMPTITER IMPT.FMENTATION 
The analytic methods described in the previous subsections can prefoibly be 
inqilonented by use of the following computer systans and according to the following 
10 programs and methods. FIG. 5 illustrates an exemplary computer system suitable for 
implanentationofthe analytic methods of this invention. Computer system 501 is 
illustrated as comprising internal components and being linked to external components. The 
internal con^onents of this computer system include processor element 502 interconnected 
with main memory 503. For example, computer system 501 can be an Intel Pentium®- 
15 based processor of 200 MHz or greater clock rate and with 32 MB or more of main memory. 

The external components include mass storage 504. This mass storage can be one or 
more hard disks (which are typically packaged together with the processor and memory). 
Such hard disks are typically of 1 GB or greater storage capacity. Other external 
components include user interface device 505, which can be a monitor, together with 
20 inputing device 506, which can be a "mouse", or other graphic ii^ut devices (not illustrated), 
and/or a keyboard. A printing device 508 can also be attached to the computer 501. 

Typically, computer system 501 is also hnked to n^rak link 507, which can be part 
of an Ethernet link to other local compute systems, remote compute systems, or wide area 
communication networks, such as the Intranet. This network link allows computw system 
25 501 to share data and processing tasks with othor computer systems. 

Loaded into memory during operation of tiiis syston are several software 
components, which are both standard in the art and special to flie instant invention. These 
software con^nents collectively cause the computer system to fimction according to the 
methods of this nivention. These software components are typically stored on mass storage 
30 504. Software ccanponoit 5 1 0 represents tiie operating system, which is responsible for 
managing computer system 501 and its network interconnections. This operating system can 
be, for example, of the Microsoft Windows' family, such as Windows 95, Windows 98, or 
Wmdows NT. Software component 5 1 1 represents common languages and functions ^ 
convmently present on this system to assist programs implementing the methods specific to 
35 this invention. Many high or low level computer languages can be used to program the 
analytic metiiods of this invention. Instiuctions can be interpreted during run-time or 
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compiled. Preferred languages include CI C-H-, FORTRON and JAVA®. Most preferably, 
the methods of this invention are programmed in mathematical software packages which 
allow symboHc entry of equations and high-level specification of processing, including 
algorithms to be used, thereby filing a user of the need to procedurally program individual 

5 equations or algorithms. Such packages include Mathib ftom Mathwoiks (Natick, MA). 
Mathematica fiom Wolfiam Research (Champaign, IL), or S-Plus fiom Math Soft 
(Cambridge, MA). Accordingly, software component 512 rq>Tesents the analytic methods of 
this invention as programmed in a procedural language or symbolic package. In a preferred 
embodiment, the computer system also contains a database 513 of perturbation response 

10 curves. 

In an exenq)lary implementation, to practice the methods of the present invraition, a 
met first loads rapression profile data into the computer system 501 . These data can be 
directly entered by the usct from monitor 505 and k^board 506, or fiom other computer 
systems linked by network connection 507, or on removable storage media such as a CD- 
IS ROM or floppy disk (not illustrated) or through the network (507). Next the user causes 
execution of expression profile analysis software 512 which performs the steps of clustering 
co-varying genes into genesets. 

In another exemplary implementation, a user first loads expression profile data into 
the computer system. Geneset profile definitions are loaded into the memory from the 
20 storage media (504) or fix)m a remote computer, preferably from a dynamic geneset database 
system, through the network (507). Next the user causes execution of projection software 
which performs the steps of convwting expression profile to projected expression profiles. 

In yet another exemplary implemoitation, a user first loads a projected profile into 
the memory. The us» thai causes the loading of a refwence profile into the memory. Next, 
25 the usCT causes the execution of coii4)arison software which performs the steps of 
objectively comparing the profiles. 

This invention also provides software for geneset definition, projection, and analysis 
for projected profiles. One embodiment of the software contains a module capable of 
executing the cluster analysis of the invention. The module is capable of causing a processor 
30 of a computo' system to «cecute steps of (a) receiving a perturbation experiment data table, 
(b) receiving the criteria for geneset selection, (c) cluster the perturbation data into a 
clustering tree, and (d) defining genesets based yxpon the clustering tree and the criteria for 

g»eset selection. - - 

Another embodiment of the software contains a module capable of executing the 
35 projection operation by causing a processor of a computer system to execute steps of (a) 



-41- 



wo 00/24936 I.CT/US99/25025 

receiving a geneset definition, (b) receiving an expression profile, and (c) calculating a 

projected profile based upon the geneset definition and the repression profile. 

Yet another onbodiment of the software contains a module enable of executing the 

comparison operation by causing a processor of a computer system to execute stqjs of 
5 (a) receiving a projected profile of a biological sample, (b) receiving a reference profile, and 

(c) calculating an objective measurement of the similarity between the two profiles. 

Alternative computer systems and software for implementing the analytic methods of 

this invention will be ^parent to one of sIdU in the art and are intended to be comprehended 

within the accompanying claims. In particular, the acc(»npanying claims are intended to 
10 include the alternative program structures for implonenting flie m^ods of this invention 

that will be readily iqiparent to one of skill in the art 

5.6. ANAT.YTIC KI T TMPT.EMENTATION 
In a preferred embodiment, the methods of this invention can be implemented by use 
15 of kits for drtermining the responses or state of a biological sample. Such kits contain 
microarrays, such as those described in Subsections below. The microarrays contained in 
such kits comprise a solid phase, e.g., a surface, to which probes are hybridized or bound at a 
known location of the solid phase. Preferably, these probes consist of nucleic acids of 
known, different sequence, with each nucleic acid being cq>able of hybridizing to an RNA 
20 species or to a cDNA species derived therefix)m. In particular, the probes contained in the 
kits of this invention are nucleic acids capable of hybridizing specifically to nucleic acid 
sequences derived from RNA species which are known to increase or decrease in response to 
perturbations to the particular protein whose activity is d^ennined by the kit. The probes 
contained in the kits of this invention preferably substantially exclude nucleic acids which 
25 hybridize to RNA species that are not increased in response to pertuibations to the particular 
protein whose activity is determined by the kit. 

In a prefwred ranbodiment, a kit of the invention also contains a database of geneset 
definitions sudi as the databases described above or an access authorization to use the 
database described above trnm a remote networked computer. 
30 In ano&er prefared embodiment, a kit of the invention fiirther contains expression 

profile projection and analysis software capable of being loaded into the memory of a 
compute system such as the one described supra in the subsection, and illustrated in FIG. 5. 
The expression profile analysis software contained in the kit of this invention, is essentially 
identical to the expression profile analysis software 512 described above. 
35 Alternative kits for implementing the analytic methods of this invention will be 

apparent to one of skill in the art and are intended to be comprehended within the 
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accompanying claims. In particular, the accompanying claims are intended to include the 
alternative program structures for implementing the methods of this invention that will be 
readily s^parent to one of skill in the art. 

5 5.7. METHODS FOR DETERMINTNG BIOLOGi r AT. RESPONSE 

This invention utilizes the ability to measure the responses of a biological system to a 
large variety of perturbations. This section provides some exemplary methods for measuring 
biological responses. One of skill in the art would appreciate that this invention is not 
limited to the following specific methods for measuring the responses of a biological system. 

10 

5,7.1. TRANSCRIPT ASSAY USING DNA ARRAY 
This invention is particularly useful for the analysis of gene expression profiles. One 
aspect of the invention provides methods for defining co-regulated genesets based upon the 
correlation of gene expression. Some embodiments of this invention are based on measuring 
15 the transcriptional rate of genes. 

The transcriptional rate can be measured by techniques of hybridization to arrays of 
nucleic acid or nucleic acid mimic probes, described in the next subsection, or by other gene 
expression technologies, such as those described in the subsequent subsection. However 
measured, the result is either the absolute, relative amounts of transcripts or response data 
20 including values representing RNA abundance ratios, which usually reflect DNA expression 
ratios (in the absence of differences in RNA degradation rates). 

In various alternative embodiments of the present invention, aspects of the biological 
state other than the transcriptional state, such as the translational state, the activity state, or 
mixed aspects can be measured. 
25 Preferably, measuranent of the transcriptional state is made by hybridization to 

transcript arrays, which are described in this subsection. Certain other methods of 
transcriptional state measurement are described later in this subsection. 

In a preferred embodiment the present invention makes use of "transcript arrays" 
(also called herein "microarrays"). Transcript arrays can be employed for analyzing the 
30 transcriptional state m a biological sample and especially for measuring the transcriptional 
states of abiological sample exposed to graded levels of a drug of interest or to graded 
perturbations to a biological pathway of interest. 

In one embodiment, transcript arrays are produced by hybridizing detectably labeled 
polynucleotides representing the mRNA transcripts present in a cell (e.g., fluorescently 
35 labeled cDNA synthesized fi-om total cell mRNA) to a microarray. A microarray is a surface 
with an ordered array of binding (e.g., hybridization) sites for products of many of the genes 
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in &e genome of a cell or organism, preferably most or almost all of the genes. Microarrays 
can be made in a number of ways, of which several are described hereinbelow. However 
produced, microarrays share certain characteristics: The anays are reproducible, allowing 
multiple copies of a given array to be produced and easily con^ared with each oflier. 
5 Preferably, the microairays are made from materials that are stable under binding (e.g., 
nucleic acid hybridization) conditions. The microarrays are preferably small, eg., between 
about 5 cm' and 25 cm', preferably about 12 to 13 cm*. However, both larger and smaUer 
arrays are also contemplated and may be prefiwable, e.g., for simultaneously evahiatmg a 
very large number of di£faent probes. 
10 Preferably, a given binding site or unique set of binding sites in the microarray will 

specifically bind {e.g., hybridize) to the product of a single gene or gene transcript from a 
cell or organism {e.g., to a specific mRNA or to a specific cDNA derived therefrom). 
Howevw, as discussed above, in general other, related or similar sequences will cross 
hybridize to a given binding site. 
1 5 The micioairays used in the methods and compositions of the present invention 

include one or more test probes, each of which has a polynucleotide sequence that is 
complementary to a subsequence of RNA or DNA to be detected. Each probe preferably has 
a different nucleic acid sequence, and the position of each probe on the solid surface of the 
array is preferably known. Indeed, the microanrays are preferably addressable arrays, more 
20 preferably positionally addressable anrays. More specifically, each probe of the array is 
preferably located at a known, predetermined position on the solid support such that the 
identity (i.e., the sequence) of each probe can be determined fi?om its position on the array 
(i.e., on the support or surface). 

Preferably, the density of probes on a microarray is about 100 different (z.e., non- 
25 identical) probes per 1 cm' or higher. More preferably, a microarray used in the methods of 
the invention wiU have at least 550 probes per 1 cm', at least 1,000 probes per 1 cm', at least 
1 ,500 probes per 1 cm' or at least 2,000 probes per 1 cm'. In a particularly prefened 
embodiment, the microarray is a high density array, preferably having a density of at least 
about 2,500 different im>bes per 1 cm'. The microarrays used in the invwition therefore 
30 preferably contain at least 2,500, at least 5,000. at least 10,000, at least 15,000, at least 
20,000, at least 25,000, at least 50,000 or at least 55,000 different (Le., non-identical) 
probes. 

In one embodimait, the microarray is an array (i.e., a matrix) in which each positions 
represents a discrete binding site for a product encoded by a gene (i.e., for an mRNA or for a 
35 cDNA derived therefrom). For example, in various embodiments, the microarrays of the 
invention can comprise binding sites for products encoded by fewer than 50% of the genes 
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in the genome of an organsim. Alternatively, the microarrays of the invention can have 
binding sites for the products encoded by at least 50%, at least 75%, at least 85%, at least 
90%, at least 95%, at least 99% or 100% of the genes in the genome of an organism or, 
alternatively, for representative genes of genesets encompassing the foregoing percentages 
5 of genes in the genome. In other embodhnents, the microarrays of the invention can having 
binding sites for products encoded by fewer than 50%. by at least 50%, by at least 75%, by 
at least 85%, by at least 90%, by at least 95%, by at least 99% or by 100% of the genes 
expressed by a cell of an organism or, alternatively, for representative genes of genesets 
encompassing the foregoing percentages of genes in the genome. The binding site can be a 

10 DNA or DNA analog to which a particular RNA can specifically hybridize. The DNA or 
DNA analog can be, eg., a syntehtic oUgomer, a fall length cDNA, a less-than fall length 
cDNA, or a gene fragment 

Preferably, the microarrays used in the invention have binding sites (/. e. , probes) for 
one or more genes relevant to the action of a drug of interest or in a biological pathway of 

15 interest A "gene" is identified as an open reading frame (ORF) tiiat encodes a sequence of 
preferably at least 50, 75, or 99 amino acid residues from which a messenger RNA is 
transcribed in the organism or in some cell or cells of a multicellular organism. The number 
of genes in a genome can be estimated from the number of mRNAs expressed by the cell or 
organism, or by extrapolation of a well characterized portion of the genome. When the 

20 genome of the organism of interest has been sequenced, the number of ORFs can be 
determined and mRNA coding regions identified by analysis of the DNA sequence. For 
example, the genome of Saccharomyces cerevisiae has been completely sequenced and is 
reported to have approximately 6275 ORFs encoding sequences longer the 99 amino acid 
residues in length. Analysis of these ORFs indicates that there are 5,885 ORFs that are 

25 likely to encode protein products (Gofifeau et al, 1996, Science 274:546-567). In conti^, 
the human genome is estimated to contain j^proximately 10* gmes. 

It will be qq)reciated that whrai cDNA complementary to the RNA of a cell is made 
and hybridized to a microarray under suitable hybridization conditions, flie level of 
hybridization to the stte in the array corresponding to any particular gene wUl reflect tiie 

30 prevalence in the cell of mRNA transcribed &om that gene. For example, when detectably 
labeled (e.g., with a fluoiophore) cDNA complementary to the total cellular mRNA is 
hybridized to a microanay, the site on the array corresponding to a gene (i.e., capable of 
specifically bindmg the product of the gene) that is not transcribed in the cejl will have little 
or no signal (e.g., fluorescent signal), and a gene for which the encoded mRNA is prevalent 

35 will have a relatively strong signal. 
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In preferred embodiments, cDNAs from two different cells are hybridized to the 
binding sites of the microairay. In the case of drug responses one biological sample is 
exposed to a drug and another biological sample of the same type is not exposed to the drug. 
In the case of pathway responses one cell is exposed to a pathway perturbation and another 
5 cell of the same type is not exposed to the pathway pertuAation. The cDNA derived from 
each of the two cell types are differentiy labeled so that they can be distinguished. In one 
embodiment, for example, cDNA from a cell treated with a drag (or exposed to a pathway 
perturbation) is synthesized using a fluoiescein-labeled dNTP, and cDNA from a second 
cell, not drug-exposed, is synthesized using a rhodamine-labeled dNTP. Wh«i the two 
10 cDNAs are mixed and hybridized to the microairay, the relative intensity of signal from eadi 
cDNA set is determined for each site on the array, and any relative difference m abundance 
of a particular mSNA detected. 

In the example described above, flie cDNA from the drug-treated (or pathway 
perturbed) cell will fluoresce green when the fluorophore is stimulated and the cDNA from 
15 the untreated cell win fluoresce red. As a result, when the drug treatment has no effect, 
either directly or indirectly, on the relative abundance of a particular mRNA in a cell, die 
mRNA will be equally prevalent in both cells and, upon reverse transcription, red-labeled 
and green-labeled cDNA will be equally prevalent. When hybridized to the microarray, the 
binding site(s) for that species of RNA will emit wavelengths characteristic of both 
20 fluorophores (and appear brown in combination). In contrast, when the drug-exposed cell is 
treated with a drug that, directly or indirectly, increases the prevalraice of the mRNA in the 
cell, the ratio of green to red fluorescence will increase. When the drug decreases the 
mRNA prevalence, the ratio will decrease. 

The use of a two-color fluorMcence labeling and detection scheme to define 
25 altaations in gene expression has been described, e.g., in Shena et al., 1995, Quantitative 
monitoring of gene expression patterns with a conqjlonentaiy DNA microairay. Science 
270:467-470, which is incorporated by reference in its entirety for all purposes. An 
advantage of using cDNA labeled with two different fluorophores is that a direct and 
int«nally controlled comparison of the mRNA levels corresponding to each arrayed gene in 
30 two cell states can be made, and variations due to minor differwices in experimental 
conditions ie.g., hybridization conditions) will not affect subsequent analyses. However, it 
win be recognized that it is also possible to use cDNA from a single cell, and compare, for 
example, the absolute amount of a particular mRNA in, e.g., a drug-treated or pathway- ^ 
potuibed ceU and an untreated cell. 
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5.7.1.1. PRFPARING NUCLEIC ACTDS FOR MIC RQARRAYS 
As noted above, the "binding site" to which a particular cognate cDNA specifically 
hybridizes is usually a nucleic acid or nucleic acid analogue attached at that binding site. In 
one embodiment, the binding sites of the microairay are DNA polynucleotides 
5 corresponding to at least a portion of each gene in an organism's genome. These DNAs can 
be obtained by, e.g., polymerase chain reaction (PGR) amplification of gene segments fiwm 
genomic DNA, cDNA (e.g., by RT-PCR), or cloned sequences. PGR primers are chosen, 
based on the known sequence of the genes or cDNA, that result in amplification of unique 
fiagments {Le., fiagmraits that do not share more than 10 bases of contiguous identical 
10 sequence with any other fragment on the microairay). Computer programs are usefiil in the 
design of primers with the required specificity and optimal amplification properties. See, 
e.g., Oligo version 5.0 (National Biosciraices). In the case of binding sites conresponding to 
very long genes, it will sometimes be desirable to amplify segments near the 3' end of the 
gene so that when oligo-dT primed cDNA probes are hybridized to the microarray, less- 
15 than-fiill laigth probes will bind efficiently. Typically each gene fiagment on the 

microarray will be between 50 bp and 50,000 bp, between 50 bp and 2000 bp, more typically 
betwewi 100 lq> and 1000 bp, and usually between 300 bp and 800 bp in length. PCR 
methods are well known and are described, for example, in Innis et al. eds., 1990, PCR 
Protocols: A Guide to Methods and Applications, Academic Press Inc., San Diego, CA, 
20 which is incorporated by reference in its entirety for all purposes. It will be apparent that 
computer controlled robotic systems are useful for isolating and amplifying nucleic acids. 

An alternative, preferred means for generating the polynucleotide probes for a 
microarray used in the methods and compositions of the invention is by synthesis of 
synthetic polynucleotides or oligonucleotides, e.g., using N-phosphonate or phosphoramidite 
25 chemistries (Fioehler era/., \9%6,Nucleic Acid Res. 7-/:5399-5407; McBridee/a/.. 1983, 
Tetrahedron Lett. 2-^:246-248). Synthetic sequences are typically between 4 and 500 bases 
in length, between 15 and 500 bases in length, more typically between 4 and 200 bases in 
length, even more preferably between 15 and 150 bases in length and stiU more preferably 
between 20 and 50 bases in length. In embodiments wherein shorter oUgonucleotide probes 
30 are used, synthetic nucleic acid sequences less than 40 bases in length are preferred, more 
prefaably between 15 and 30 bases m length. In embodiments wherem longer 
oligonucleotide probes are used, synthetic nucleic acid sequences are preferably between 40 
and 80 bases m length, more preferably between 40 aiid 70 bases in length and even more 
preferably between 50 and 60 bases in length. In some embodiments, synthetic nucleic acids 
35 include non-natural bases, such as, but not limited to, insoine. As noted above, nucleic acid 
analogs may be used as binding sites for hybridization. An example of a suitable nucleic 
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acid analog is peptide nucleic acid (see, eg., Egholm et aL, 1993, Nature 553:566-568; U.S. 
Patent No. 5,539,083). 

In an alternative embodiment, the binding (hybridization) sites are made firom 
plasmid or phage clones of genes, cDNAs (e.g., expressed sequence tags), or inserts 
5 therefrom (Nguyen et al, 1995, Differential gene expression in the murine thymus assayed 
by quantitative hybridization of arrayed cDNA clones. Genomics 29:207-209). In yet 
another embodiment, the polynucleotide of the binding sites is RNA. 

5.7.1.2. ATTACHING NUCLEIC ACIDS TO THE SO LID SURFACE 

10 The nucleic acid or analogue are attached to a solid support, which may be made 

from glass, plastic {e.g. , polypropylene, nylon), polyacrylamide, nitrocellulose, or other 
materials. A preferred method for attaching the nucleic acids to a surface is by printing on 
glass plates, as is described generally by Schena et al., 1995, Quantitative monitoring of 
gene expression patterns with a complementary DNA microarray, Science 270:467-470. 

15 This method is especially useful for preparing microarrays of cDNA. See also DeRisi et al, 
1996, Use of a cDNA microarray to analyze gene expression patterns in human cancer, 
Nature Genetics 14:457-460; Shalon et al, 1996, A DNA microarray system for analy2dng 
complex DNA samples using two-color fluorescent probe hybridization, Genome Res. 
6:639-645; and Schena et al, 1995, Parallel human genome analysis; microarray-based 

20 expression of 1000 genes, Proc. Natl Acad. Sci. USA 93:10539-1 1286. 

A second preferred method for making microarrays is by making high-density 
oUgonucleotide arrays. Techniques are known for producing arrays containing thousands of 
oUgonucleotides complementary to defined sequences, at defined locations on a surface 
using photolithographic techniques for synthesis in situ {see, Fodor et al., 1991, Light- 

25 directed spatially addressable parallel chemical synthesis, Science 251:767-773; Pease et al, 
1994, Light-directed oUgonucleotide arrays for rapid DNA sequmce analysis, Proc. Nati. 
Acad. Sci. USA 91:5022-5026; Lockhart et al, 1996, Expression monitoring by 
hybridization to high-density oligonucleotide arrays. Nature Biotech 14:1675; U.S. Patent 
Nos. 5,578,832; 5,556,752; and 5,5 10,270, each of which is mcorporated by reference in its 

30 entirety for all purposes) or other methods for rapid synthesis and deposition of defined 
oligonucleotides (Blanchard et al, 1996, High-Density Oligonucleotide arrays, Biosensors 
& Bioelectronics 11: 687-90). When these methods are used, oligonucleotides (e.g., 20- 
mers) of known sequence are synthesized directly on a surface such as a derivatized glass , 
slide. Usually, the array produced contains multiple probes against each target transcript. 

35 Oligonucleotide probes can be chosen to detect alternatively spliced mRNAs or to serve as 
various type of control. 



-48- 



wo 00/24936 



PCTAJS99/25025 



Another preferred method of making microarrays is by use of an inkjet printing 
process to synthesize oligonucleotides directly on a solid phase, as described, e.g., in 
co-pending U.S. patent application Serial No. 09/008,120 filed on January 16, 1998, by 
Blanchard entitled "Chemical Synthesis Using Solvent Microdroplets", which is 

5 incorporated by reference herein in its entirety. 

Other methods for making microarrays, e.g., by masking (Maskos and Southern, 
1992, Nuc. Acids Res. 20:1679-1684), may also be used. In principal, any type of array, for 
example, dot blots on a nylon hybridization membrane (see Sambrook et al., Molecular 
Cloning - A Laboratory Manual (2nd Ed.), Vol. 1-3, Cold Spring Harbor Laboratory, Cold 

10 Spring Harbor, New York, 1989), could be used, although, as wiU be recognized by those of 
skill m the art, very small arrays will be preferred because hybridization volumes will be 
smaller. 

In a particularly preferred embodiment, micorarrays used in the invention are 
manufactured by means of an mk jet printing device for oligonucleotide synthesis, e.g., 

15 using the methods and systems described by Blanchard in International Patent Publication 
No. WO 98/41531, published on September 24, 1998; Blanchard et al, 1996, Biosensors 
and Bioeletronics 77:687-690; Blanchard, 1998, in Synthetic DNA Arrays in Genetic 
Engineering, Vol 20, J.K. Setlow, ed., Plenum Press, New York at pages 1 1 1-123. 
Specifically, the oligonucleotide probes in such microarrays are preferably synthesized by 

20 serially depositing individual nucleotides for each probe sequence in an array of 
"microdroplets" of a high surface tension solvent such a propylene carbonate. The 
microdroplets have small volumes {e.g., 100 pL or less, more preferably 50 pL or less) and 
are separated from each other on the microairay {e.g., by hydrophobic domains) to form 
circular surface tension wells which define the locations of the array elements the 

25 different probes). 

5.7.1.3. TARGET POLYNUCLEOTIDE MOLECULES 
Methods for preparing total and poly(A)+ RNA are well known and are described 
generally in Sambrook et al, supra. In one embodiment, RNA is extracted from cells of the 
30 various types of mterest in this invention using guanidinium thiocyanate lysis followed by 
CsCl centrifiigation (Chirgwin et al, 1979, Biochemistry 18:5294-5299). Poly(A)+ RNA is 
selected by selection with oligo-dT cellulose {see Sambrook et al. supra). Cells of interest 
include wild-type cells, drug-exposed, wild-type cells, modified cells, and drug-exposed ^ 
modified cells. 

35 Labeled cDNA is prepared from mRNA by oligo dT-primed or random-primed 

reverse transcription, both of which are well known in the art {see. e.g., Klug and Berger, 
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1987, Methods Enzymol. 152:316-325). Reverse transcription may be carried out in the 
presence of a dNTP conjugated to a detectable label, most preferably a fluorescently labeled 
dNTP. Alternatively, isolated mRNA can be converted to labeled antisense RNA 
synthesized by in vitro transcription of double-stranded cDNA in the presence of labeled 

5 dNTPs (Lockhart et al , 1 996, Expression monitoring by hybridization to high-density 
oligonucleotide arrays, Nature Biotech. 14:1675, which is mcoiporated by reference in its 
entirety for all purposes). In alternative embodiments, the cDNA or RNA probe can be 
synthesized in the absence of detectable label and may be labeled subsequently, e.g., by 
incorporating biotinylated dNTPs or rNTP, or some similar means (e^., photo-cross-linldng 

10 a psoialra derivative of biotin to RNAs), followed by addition of labeled streptavidin 
phycoeryfhrin-conjugated streptavidin) or the equivalent 

When fluorescently-labeled probes are used, numy suitable fluorophores are known, 
including fluorescein, lissamine, phycoeiythrin, rhodamme (Perkin Ehner Cetus), Cy2, Cy3, 
Cy3.5, Cy5, Cy5.5, Cy7, FluorX (Amersham) and others {see. e,g,, Kricka, 1992, 

15 Nonisotopic DNA Probe Techniques, Academic Press San Diego, CA). It will be 

appreciated that pairs of fluorophores are chosen that have distinct emission spectra so that 
they can be easily distinguished. 

In another embodiment, a label other than a fluorescent label is used. For example, a 
radioactive label, or a pair of radioactive labels with distinct emission spectra, can be used 

20 (see Zhao et ai, 1995, High density cDNA filter analysis: a novel approach for large-scale, 
quantitative analysis of gene expression. Gene 156:207; Pietu et al, 1996, Novel gene 
transcripts preferentially expressed in human muscles revealed by quantitative hybridization 
of a high density cDNA array. Genome Res. 6:492). However, because of scattering of 
radioactive particles, and the consequent requirement for widely spaced binding sites, use of 

25 radioisotopes is a less-preferred embodiment. 

In one embodiment, labeled cDN A is synthesized by mcubating a mixture containing 
0.5 mM dGTP, dATP and dCTP plus 0.1 mM dTTP plus fluorescent deoxyribonucleotides 
(e.g., 0.1 mM Rhodamme 1 10 UTP (Perken Ehner Cetus) or 0.1 mM Cy3 dUTP 
(Amersham)) with reverse transcriptase Superscript™ II, LTI Inc.) at 42*" C for 60 

30 nun. 

5.7.1.4. HYBRIDIZATION TO MICROARRAYS 
Nucleic acid hybridization and wash conditions are optimally chosen so that the 
piobe "specifically binds" or "specifically hybridizes" to a specific array site, i.e., tiie probe 
hybridizes, duplexes or binds to a sequence array site with a complementary nucleic acid 
35 sequence but does not hybridize to a site with a non-complementary nucleic acid sequence. 
As used herein, one polynucleotide sequence is considered complementary to another when. 



.50- 



wo 00/24936 



PCTAJS99/25025 



if the shorter of the polynucleotides is less than or equal to 25 bases, there are no 
mismatches using standard base-pairing rales or, if the shorter of the polynucleotides is 
longer than 25 bases, there is no more than a 5% mismatch. Preferably, the polynucleotides 
are perfectly complementary (no mismatches). It can easUy be demonstrated that specific 

5 hybridization conditions result in specific hybridization by carrying out a hybridization assay 
including negative controls (see. e.g., Shalon et al., supra, and Chee et aL, supra). 

Optimal hybridization conditions will depend on the length oligomer versus 
polynucleotide greater than 200 bases) and type ie.g., RNA, DNA. PNA) of labeled probe 
and immobilized polynucleotide or oUgonucleotide. General parameters for specific (ic, 

10 stringent) hybridization conditions for nnclac acids are described in Sambrook et al. supra, 
and in Ausubel et al., 1987, Current Protocols in Molecular Biology. Greene PubUshing and 
Wiley-Interscience, New York. When the cDNA microarrays of Schena et al. are used, 
typical hybridization conditions are hybridization in 5 X SSC plus 0.2% SDS at 65" C for 4 
hours followed by washes at 25 ° C in low stringency wash buffer (1 X SSC plus 0.2% SDS) 

15 foUowed by 10 minutes at 25 ° C in high stringency wash buffer (0.1 X SSC plus 0.2% SDS) 
(Shena et al, 1996, Proc. Natl. Acad. Sci. USA, 93:10614). Useful hybridization conditions 
are also provided in, e.g., Tijessen, 1993, Hybridization With Nucleic Acid Probes, Elsevier 
Science Publishers B.V. and Kricka, 1992, Nonisotopic DNA Probe Techniques, Academic 
Press San Diego, CA. 

20 

5.7.1.5. SIGNAL DETF-CTTON AND D ATA ANALYSIS 
When fiuorescently labeled probes are used, the fluorescence emissions at each site 
of a transcript array can be, preferably, detected by scanning confocal laser microscopy. In 
one embodiment, a separate scan, using the appropriate excitation line, is carried out for 
25 each ofthe two fluorophores used. Alternatively, a laser can be used that allows 

simultaneous qjecimen iUumination at wavelengths specific to the two fluorophores and 
emissions torn the two fluorophores can be analyzed simultaneously (see Shalon et al., 
1996, A DNA microarray Systran for analyzing conq>lex DNA samples using two-color 
fluorescent probe hybridization. Genome Research 6:639-645, which is incorporated by 

30 reference in its entirety for all purposes). In a preferred embodiment, the arrays are scanned 
with a laser fluorescent scanner with a computer controlled X-Y stage and a microscope 
objective. Sequential exciUtion ofthe two fluorophores is achieved with a multi-line, mixed 
gas laser and the emitted Ught is split by wavelength and detected with two jhotomultipUer 
tubes. Fluorescence laser scanning devices are described in Schena et al, 1996, Genome 

35 Res. 6:639-645 and in other references cited herein. Alternatively, the fiber-optic bundle 
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described by Ferguson et al, 1996, Nature Biotech. 14:1681-1684, may be used to monitor 
mRNA abundance levels at a large number of sites simultaneously. 

Signals are recorded and, in a preferred embodiment, analyzed by computer, e.g., 
using a 12 bit analog to digital board. In one embodiment the scanned image is despeckled 

5 using a graphics program {e.g., Hijaak Graphics Suite) and then analyzed using an image 
gridding program that creates a spreadsheet of the average hybridization at each wavelength 
at each site. If necessary, an experimentally detennined correction for "cross talk" (or 
overlap) between the channels for the two fluors may be made. For any particular 
hybridization site on the transcript anay, a ratio of the emission of the two fluorophores can 

10 be calculated. The ratio is independent of the absolute expression level of the cognate gme, 
but is useful for genes whose expression is significantly modulated by drug administration, 
gene deletion, or any other tested event 

Accordmg to the method of the invention, the relative abundance of an mRNA in two 
biological samples is scored as a perturbation and its magnitude determined {i.e., the 

15 abundance is different in the two sources of mRNA tested), or as not perturbed (z.e., the 
relative abundance is the same). In various embodiments, a difference between the two 
sources of RNA of at least a factor of about 25% (RNA from one source is 25% more 
abundant in one source than the other source), more usually about 50%, even more often by 
a factor of about 2 (twice as abundant), 3 (three times as abundant) or 5 (five times as 

20 abundant) is scored as a perturbation. 

Preferably, in addition to identifying a perturbation as positive or negative, it is 
advantageous to determine the magnitude of the perturbation. This can be carried out, as 
noted above, by calculating the ratio of the emission of the two fluorophores used for 
differential labeUng, or by analogous methods that will be readily apparent to those of skill 

25 in the ait. 

5.7.2. PATHWAY RESPONSE AND GENESETS 
In one embodiment of the present invention, genesets are determined by observing 
the gene expression response of perturbation to a particular pathway. In one embodiment of 
30 the invention, transcript arrays reflecting the transcriptional state of a biological sample of 
interest are made by hybridizing a mixture of two differently labeled probes each 
corresponding (i.e., complementary) to the mRNA of a different sample of interest, to the 
microarray. According to the present, invention, the two samples are of the same type, i.e.,, 
of the same species and strain, but may differ genetically at a small number (e.g., one, two, 
35 three, or five, preferably one) of loci. Alternatively, they are isogeneic and differ in their 
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environmental history {e.g., exposed to a drug versus not exposed). The genes whose 
expression are highly correlated may belong to a geneset 

In one aspect of the invention, gene expression change in response to a large number 
of perturbations is used to construct a clustering tree for the purpose of defining genesets. 
5 Preferably, the perturbations diould target different pathways. In order to measure 

expression responses to the pathway perturbation, biological samples are subjected to graded 
perturbations to pathways of interest. The samples exposed to the perturbation and samples 
not exposed to tiie perturbation are used to construct transcript arrays, which are measured to 
find the mRNAs with modified expression and the degree of modification due to exposure to 
10 the perturbation. Thetrf>y, the perturbation-response relationship is obtained. 

The density of levels of the graded drug exposure and graded perturbation control 
parameter is governed by the sharpness and structure in the individual gene responses - the 
steq)er the steepest part of the response, the denser the levels needed to properly resolve the 
response. 

15 Further, it is preferable in order to reduce experimental error to reverse the 

fluorescent labels in two-color differential hybridization experiments to reduce biases 
pecuhar to individual genes or array spot locations. In other words, it is preferable to first 
measure gene expression with one labeUng (e.g., labeling perturbed ceUs with a first 
fluoiochrome and unperturbed cells with a second fluorochrome) of the mRNA fi?om the two 

20 cells being measured, and then to measure gene expression firom the two cells with reversed 
labeling (e.g., labeling perturbed cells with the second fluorochrome and uiqpertmbed cells 
with the first fluorochrome). Multiple measurements over exposure levels and perturbation 
control parameter levels provide additional experimental error control. With adequate 
sampling a trade-off may be made when choosing the width of the spline fimction S used to 

25 interpolate response data between averaging of errors and loss of structure in the response 
functions. 

5.7.3. MFASIIREMENT OF GRADEn PRRTURBAT TON RESPONSE DATA 
To measure graded response data, the cells are exposed to graded levels of the drug, 
30 drug candidate of interest or grade strength of other perturbation. When the cells are grown 
in vitro, the compound is usuaUy added to their nutrient medium. In the case of yeast, it is 
preferable to harvest Ae yeast in early log phase, since expression patterns are relatively 
insensitive to time of harvest at that time. Several levels of the drug or other compounds are 
added. The particular level employed depends on the particular characteristics of the drug, 
35 but usually will be between about 1 ng/ml and 100 mg/ml. In some cases a drug will be 
solubilized in a solvent such as DMSO. 
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The cells exposed to the drug and cells not exposed to the drug are used to construct 
transcript arrays, which are measured to find the mRNAs with altered expression due to 
exposure to the drug. Thereby, the drug response is obtained. 

Similarly for measurements of pathway responses, it is preferable also for drug 
5 responses, in the case of two-color differential hybridization, to measure also with reversed 
labeling. Also, it is preferable that the levels of drug exposure used proved sufficient 
resolution (e.g., by using approximately 10 levels of drug exposure) of rapidly changing 
regions of the drug response. 

10 5.7.4. OTHER METHODS OF TRANSCRIPTIONAL STATE MEASUREMENT 
The transcriptional state of a cell may be measured by other gene expression 
technologies known in the art. Several such technologies produce pools of restriction 
fiagments of limited complexity for electrophoretic analysis, such as methods combining 
double restriction enzyme digestion with phasing primers {see, European Patent O 

15 534858 Al, filed September 24, 1992, by Zabeau et al.), or methods selecting restriction 
fragments with sites closest to a defined mRNA end {see, e.g., Prashar et al,, 1996, Proc. 
Natl. Acad. Sci. USA 93:659-663). Other methods statistically sample cDNA pools, such as 
by sequencing sufficient bases (e.g., 20-50 bases) in each of multiple cDNAs to identify 
each cDNA, or by sequencing short tags (e.g., 9-10 bases) which are generated at known 

20 positions relative to a defined mRNA end (^ee, e.g., Velculescu, 1995, Science 270:484- 
487). 

5.7.5. MEASUREMENT OF OTHER ASPECTS OF BIOLOGICAL STATE 
In various embodiments of the present invention, aspects of the biological state other 
25 than the transcriptional state, such as the translational state, the activity state, or mixed 

aspects can be measured in order to obtain drug and pathway responses. Details of these 

embodiments are described in this section. 

5.7.5.1. EMBODIMENTS BASED ON TRANSLATIONAL STATE MEASUREMENTS 
30 Measurement of the translational state may be performed according to several 

methods. For example, whole genome monitoring of protein (i.e., the "proteome," Goffeau 
et al^ supra) can be carried out by constructing a microarray in which binding sites comprise 
immobilized, preferably monoclonal,. antibodies specific to a pluraUty-of protein species , . 
encoded by the cell genome. Preferably, antibodies are present for a substantial flection of 
35 the encoded proteins, or at least for those proteins relevant to the action of a drug of interest. 
Methods for making monoclonal antibodies are well known (see, e.g., Harlow and Lane, 
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1988, Antibodies: A Laboratory Manual, Cold Spring Harbor, New Yoric, which is 
incorporated in its entirety for all purposes). In a preferred embodiment, monoclonal 
totibodies are raised against synthetic peptide fragments designed based on genomic 
sequence of the cell. With such an antibody array, proteins from the cell are contacted to the 

5 array and then- binding is assayed with assays known in the art 

Alternatively, proteins can be separated by two-dimensional gel electrophoresis 
systems. Two-dimensional gel electrophoresis is well-known in the art and typically 
involves iso-electric focusing along a first dimension followed by SDS-PAGE 
electrophoresis along a second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis 

10 of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al, 1996, Proc. 
Naf 1 Acad. Sci. USA 93:1440-1445; Sagliocco et al, 1996, Yeast 12:1519-1533; Lander, 
1996, Science 274:536-539. The resulting electropherograms can be analyzed by numerous 
techniques, including mass spectrometric techniques, western blotting and immunoblot 
analysis using polyclonal and monoclonal antibodies, and internal and N-terminal micro- 

15 sequencing. Using these techniques, it is possible to identify a substantial fraction of all the 
proteins produced under given physiological conditions, including in cells (e.g., in yeast) 
exposed to a drug, or m cells modified by, e.g., deletion or over-expression of a specific 
gene. 

20 5.7.5.2. FMBQDIMENTS BASED ON O THER ASPECTS OF THE 

RTOTOnTCAL STATE 
Even though methods of this invention are illustrated by embodunents involving 
gene expression profiles, the methods of the invention are ^plicable to any cellular 
constituent that can be monitored. 

25 In particular, where activities of proteins relevant to the characterization of a 

perturbation, such as drug action, can be measured, embodiments of this invention can be 
based on such measurements. Activity measurements can be performed by any fimctional, 
biochemical, or physical means appropriate to the particular activity being characterized. 
Where the activity involves a chemical transformation, the cellular protein can be contacted 

30 with the natural substrate(s), and the rate of transformation measured. Where the activity 
involves association in multimeric units, for example association of an activated DNA 
binding complex with DNA, the amount of associated protein or secondary consequences of 
the association, such as amounts of mRNA transcribed, can be measured. Also, where only 
a fimctional activity is known, for example, as in cell cycle control, perfomance of the 

35 fimction can be observed. However known and measured, the changes in protein activities 
form the response data analyzed by the foregoing methods of this invention. 
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In alternative and non-limiting embodiments, response data may be formed of mixed 
aspects of the biological state of a cell. Response data can be constructed fix)m, e.g., 
changes in certain mRNA abundances, changes in certain protein abundances, and changes 
in certain protein activities. 

5 

5.8. MRTHOD FOR PROBING CE LLULAR STATES 
One aspect of the invention provides methods for flie analysis of co-vaiying cellular 
constituents. The methods of this invention are also useful for flie analysis of responses of a 
biological sample to perturbations designed to probe cellular state. This section provides 

10 some illustrative methods for probing cellular states. 

Methods for targeted perturbation of cellular states at various levels of a cell are 
increasingly widely known and q)plied in tiie art. Any such methods that are C2q)able of 
specifically targeting and controUably modifying (eg., either by a graded increase or 
activation or by a graded deopease or mhibition) specific cellular constituents (e.g., gene 

15 expression, RNA concentrations, protein abundances, protein activities, or so forth) can be 
employed in performing cellular state perturbations. Controllable modifications of cellular 
constituents consequentially controllably perturb cellular states originating at the modified 
cellular constituents. Preferable modification methods are capable of individually targeting 
each of a plurality of cellular constituents and most preferably a substantial fraction of such 

20 cellular constituents. 

The following metiiods are exOTiplary of tiiose that can be used to modify cellular 
constituents and thereby to produce cellular state perturbations which generate the cellular 
state responses used in the steps of the methods of this invention as previously described. 
This invention is adaptable to other methods for making controllable perturbations to 

25 cellular states, and especially to cellular constituents from which cellular states origuiate. 

Cellular state perturbations are preferably made in cells of cell types derived from 
any organism for which genomic or expressed sequence mformation is available and for 
which methods are available that pemiit controllable modification of the expression of 
specific genes. Genome sequencing is currently underway for several eukaiyotic organisms, 

30 mcluding humaiis, nematodes, i4ra6irf6!p5£s, and flies. In a preferred embodiment, the 

invention is carried out using a yeast, with Saccharomyces cerevisiae most preferred because 
the sequence of the entire genome of a 5. cerevisiae stram has been determined. In addition, 
well-established methods are available for controllably modifying expression of yeast genes. 
A preferred strain of yeast is a S. cerevisiae strain for which yeast genomic sequence is 

35 known, such as strain S288C or substantially isogeneic derivatives of it {see, eg.. Nature 
369, 371-8 (1994); P,N.AS 92:3809-13 (1995); E.M,B.O. J. 13:5795-5809 (1994), Science 
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265:2077-2082 (1994); EMB.O. J. 15:2031-49 (1996), all of which are incorporated herein. 
However, other strains may be used as well. Yeast strains are available from American Type 
Culture Collection, Manassas, Virginia. Standard techniques for manipulating yeast are 
described in C. Kaiser. S. MichaeUs, & A. Mitchell, 1994, Methods jn Yeast Genetics: A 

5 Cold Soring Harbor Laboratorv Course Manual. Cold Spring Harbor Laboratory Press, New 
Yoiic; and Sherman et ai, 1986, Methods in Yea st nenetics: A T ^boratorv Manual, Cold 
Spring Harbor Laboratory, Cold Spring Harbor. New York, both of which are incorporated 
by reference in their entirety and for all purposes. 

The exemplary methods described in the following include use of titratable 

10 expression systems, use of transfection or viral transduction systems, direct modifications to 
RNA abundances or activities, direct modifications of protein abundances, and direct 
modification of protein activities including use of drugs (or chemical moieties in general) 
with specific known action. 

15 5.8.1. TTTRATABLE F YPRP-SSTQN SYSTEMS 

Any of the several known titratable, or equivalently controllable, expression systems 
available for use in the budding yeast Saccharomyces cerevisiae are adaptable to this 
invention (Mumbeig et al, 1994. Regulatable promoter ol Saccharomyces cerevisiae: 
comparison of transcriptional activity and their use for heterologous expression, Nucl. Acids 
20 Res. 22:5767-5768). Usually, gene expression is controlled by transcriptional controls, with 
the promoter of the gene to be controlled replaced on its chromosome by a controllable, 
exogenous promoter. The most commonly used controllable promoter in yeast is the GALl 
promoter (Johnston et a/., 1984, Sequences that regulate the divergent GALI-GALIO 
promoter in Saccharomyces cerevisiae. Mol Cell. Biol. 8:1440-1448). The GALl promoter 
25 is strongly repressed by the presence of glucose in the growth medium, and is gradually 
switched on in a graded manner to high levels of expression by the decreasing abundance of 
glucose and the presence of galactose. The GALl promoter usually aUows a 5-100 fold 
range of expression control on a gene of interest. 

Other ftequenUy used promotw s^ems include Ae MET25 promoter (Kerjan et al., 
30 1986, Nucleotide sequence of the Saccharomyces cerevisiae MET25 gene. Nucl. Acids. Res. 
14:7861-7871), which is induced by the absaice of methionine in the growth medium, and 
the CUPl promoter, which is induced by copper (Mascorro-Gallardo et al, 1996, 
C^onstruction of a CUP 1 promoter-based vector to modulate gene expression in 
Saccharomyces cerevisiae. Gene 172:169-170). All of these promoter systems are 
35 controUable in that gene expression can be incrementally controlled by incremental changes 
in the abundances of a controlling moiety in the growth medium. 
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One disadvantage of the above listed expression systems is that control of promoter 
activity (effected by, e.g., changes in carbon source, removal of certain amino acids), often 
causes othCT changes in cellular physiology which independently alter the expression levels 
of other genes. A recently developed system for yeast, the Tct system, alleviates this 

5 problem to a large extent (Gari et al, 1997, A set of vectors with a tctracycline-regulatable 
promoter system for modulated gene expression in Saccharomyces cerevisiae. Yeast 13:837- 
848). The Tet promoter, adopted from mammalian expression systems (Gossen et aL, 1995, 
Transcriptional activation by tetracyclmes in mammalian cells, Proc. Nat. Acad* Sci. USA 
89:5547-5551) is modulated by the concentration of the antibiotic tetracycline or the 

10 structurally related compound doxycycline. Thus, in the absence of doxycycline, the 
promoter mduces a high level of expression, and the addition of increasing levels of 
doxycycline causes increased rq>ression of promoter activity. Intermediate levels of gene 
expression can be achieved in the steady state by addition of intermediate levels of drug. 
Furthermore, levels of doxycycline that give maxunal repression of promoter activity (10 

15 micrograms/ml) have no significant effect on the growth rate on wild type yeast cells (Gari 
et aL, 1997, A set of vectors with a tetracycline-regulatable promoter system for modulated 
gene expression in Saccharomyces cerevisiae, Yeast 13:837-848). 

In mammalian cells, several means of titrating expression of genes are available 
(Spencer, 1996, Creating conditional mutations in mammals, Trends Genet. 12:181-187). 

20 As mentioned above, the Tet system is widely used, both in its original form, the "forward" 
system, in which addition of doxycycline represses transcription, and in the newer "reverse" 
system, in which doxycycline addition stimulates transcription (Gossen et aL^ 1995, Proc. 
Natl. Acad Sci. USA 89:5547-5551; Hofiftnann et al., 1997. Nucl. Acids. Res. 25:1078- 
1079; Hofinann et al, 1996, Proc. Natl. Acad, Sci. USA 83:5185-5190; Paulus et al, 1996, 

25 Journal of Virology 70:62-67). Anothw commonly used controllable promoter system in 
mammalian cells is the ecdysone-inducible system developed by Evans and colleagues (No 
et al.f 1996, Ecdysone-inducible gene expression in mammalian cella and transgenic mice, 
Proc. Nat. Acad. Sci. USA 93:3346-3351), where expression is controlled by the level of 
muristerone added to the cultured cells. Finally, expression can be modulated using the 

30 "chemical-induced dimerization" (CID) system developed by Schreiber, Crabtree, and 

colleagues (Belshaw et aU 1996, Controlling protein association and subcellular localization 
with a synthetic ligand that induces heterodimerization of proteins, Proc. Nat. Acad. Sci. 
USA 93:4604-4607; Spencer, 1996, Creating conditional mutations injnammals. Trends . 
Genet. 12:181-187) and similar systems in yeast. In this system, the gene of interest is put 

35 under the control of the CID-responsive promoter, and transfected into cells expressing two 
different hybrid proteins, one comprised of a DNA-binding domain fused to FKBP12, which 
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binds FK506. The other hybrid protein contains a transcriptional activation domain also 
fused to FKBP12. The CID inducing molecule is FK1012, a homodimeric version of FK506 
that is able to bind simultaneously both the DNA binding and transcriptional activating 
hybrid protems. In the graded presence of FKl 012, graded transcription of the controUed 
S gene is activated. 

For each of the mammalian expression systems described above, as is widely known 
to those of skin in the art, the gene of interest is put under the control of the controllable 
promoter, and a phismid harboring this construct along with an antibiotic resistance gene is 
transfected into cultured mammalian cells. In general, the plasmid DNA integrates into the 

10 genome, and drug resistant colonies are selected and screened for appropriate expression of 
the regulated gene. Alternatively, the regulated gene can be inserted into an episomal 
plasmid such as pCEP4 (Invitrogen, Inc.), which contains components of the Epstein-Barr 
virus necessary for plasmid replication. 

In a preferred embodiment, titratable expression systems, such as the ones described 

15 above, are introduced for use into cells or organisms lacking the corresponding endogenous 
gene and/or gene activity, e.g.. organisms in which the endogenous gene has been disrupted 
or deleted. Methods for producing such "knock outs" are well known to those of skill in the 
art, see e.g.. Pettitt et al, 1996, Development 122:4149-4157; Spradling et al, 1995, Proc. 
Natl. Acad. Sci. USA, 92:10824-10830; Ramirez-Solis et al., 1993, Methods Enzymol. 

20 225:855-878; and Thomas et al, 1987, Cell 51:503-512. 

5.8.2. TRANSFECTION SYSTEMS FOR MAMMAT.T AN CELLS 
Transfection or viral transduction of target genes can introduce controllable 
perturbations in biological cellular states in mammalian cells. Preferably, transfection or 

25 transduction of a target gene can be used with cells that do not naturally express the target 
gene of interest. Such non-expressing cells can be derived firom a tissue not normally 
expressing the target gene or the target gene can be specifically mutated in the cell. The 
target gene of interest can be cloned into one of many mammalian expression plasmids, for 
example, the pcDNA3.1 +/- system (Invitrogen. Inc.) or retroviral vectors, and introduced 

30 into the non-expressing host cells. Transfected or transduced cells expressing the target gene 
may be isolated by selection for a drug resistance marker encoded by the expression vector. 
The level of gene transcription is monotonically rehted to the transfection dosage. In this 
way, the effects ofvarying levels ofthe target gene may be investigated. „ 

A particular example ofthe use of this method is the search for drugs that target the 

35 src-family protein tyrosine kinase, Ick, a key component of the T cell receptor activation 
cellular state (Anderson et al., 1994, Uivolvement of the protein tyrosine kinase p56 (Ick) in 
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T cell signaling and thymocyte development, Adv. Immunol. 56:171-178). Inhibitors of this 
enzyme are of interest as potential immunosiqjpressive drugs (Hanke, 1996, Discovery of a 
Novel, Potent, and src family-selective tyrosine kinase inhibitor, J. Biol Chem 271:695-701). 
A specific mutant of the Juikat T cell line (JcaMl) is available that does not express Ick 

5 kinase (Straus et aL, 1992, Genetic evidence for the involvement of the Ick tyrosine kinase 
in signal transduction through the T ceU antigrai receptor. Cell 70:585-593). Therefore, 
introduction of the Ick gene into JCaMl by transfection or transduction permits specific 
perturbation of cellular states of T ceU activation regulated by the Ick kinase. The efficiency 
of transfection or transduction, and thus the level of perturbation, is dose related. The 

10 m^od is gaierally useful for providing perturbations of gene expression or protein 
abundances in cells not normally expressing the genes to be perturbed. 

5.8.3. METHODS OF MnTOFYTNG RN A ABUNDANCES OR ACTIVrnES 
Methods of modifying RNA abundances and activities currently fall within three 
15 classes, ribo^raes, antisraise species, and RNA aptamers (Good et al., 1997, Gene Ther^y 
4: 45-54). Controllable application or exposure of a cell to these entities permits 
controllable perturbation of RNA abundances. 

Ribozymes are RNAs which are capable of catalyzing RNA cleavage reactions. 
(Cech, 1987, Science 236:1532-1539; PCT International Publication WO 90/11364, 
20 published October 4, 1990; Sarver et al., 1990, Science 247: 1222-1225). "Hairpin" and 
"hammerhead" RNA ribozymes can be designed to specifically cleave a particular target 
mRNA. Rules have been established for the design of short RNA molecules with ribozyme 
activity, which are capable of cleaving other RNA molecules in a highly sequence q)edfic 
way and can be targeted to virtually all kinds of RNA. QAwaXoSet al., 1988, Nature 
25 334:585-591; Koizumi etal., 1988, FEBS Lett, 228:228-230; Koizumi et al., 1988, FEBS 
Lett, 239:285-288). Ribozyme methods involve «tposing a cell to, inducing expression in a 
cell, etc. of such small RNA ribozyme molecules. (Grassi and Marini, 1996, Annals of 
Medicine 28: 499-510; Gibson, 1996, Cancer and Metastasis Reviews 15: 287-299). 

Ribozymes can be routinely expressed in vivo in sufficient number to be catalytically 
30 efifective in cleaving mRNA, and thereby modifying mRNA abundances in a cell. (Gotten et 
al., 1989, Ribozyme mediated destruction of RNA in vivo. The EMBO J. 8:3861-3866). In 
particular, a ribozyme coding DNA sequence, designed according to the previous rules and 
synthesized, for example, by standard phosphoramidite chemistry, can be Hgated into a . 
restriction enzyme site in the anticodon stem and loop of a gene encoding a tRNA, which 
35 can then be transformed into and expressed in a cell of interest by methods routine in the art. 
Preferably, an inducible promoter {e.g.. a glucocorticoid or a tetracycline response element) 
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is also introduced into this construct so that ribozyme expression can be selectively 
controlled. tDNA genes (Le., genes encoding tRNAs) are usefiil in this application because 
of their smaU size, high rate of transcription, and ubiquitous expression in different kinds of 
tissues. Therefore, ribozymes can be routinely designed to cleave virtually any mRNA 
5 sequence, and a cell can be routinely transformed with DNA coding for such ribozyme 
sequences such that a controllable and catalytically effective amount of the ribozyme is 
oqiressed. Accordingly ttie abundance of virtually any RNA qiecies in a cell can be 
perturbed. 

In another embodiment, activity of a target RNA (preferable mRNA) species, 
10 specifically its rate of translation, can be controllably inhibited by the controllable 

application of antisense nucleic acids. An "antisense" nucleic acid as used herein refers to a 
nucleic acid cj^able of hybridizing to a sequence-specific (e.g., non-poly A) portion of the 
target RNA, for exan^le its translation initiation region, by virtue of some sequence 
conqilonentarity to a coding and/or non-coding region. The antisense nucleic acids of the 
1 5 invention can be oligonucleotides that are double-stranded or single-stranded, RNA or DNA 
or a modification or derivative tiiereof, which can be directly administered hi a controllable 
manner to a cell or which can be produced intracellularly by transcription of exogenous, 
introduced sequences in controllable quantities sufficient to perturb translation of the target 
RNA. 

20 Preferably, antisense nucleic acids are of at least six nucleotides and are prefoably 

oligonucleotides (ranging from 6 to about 200 oligonucleotides). In specific aspects, the 
oligonucleotide is at least 10 nucleotides, at least 15 nucleotides, at least 100 nucleotides, or 
at least 200 nucleotides. The oligonucleotides can be DNA or RNA or chimeric mixtures or 
derivatives or modified versions fliereot single-stranded or double-stranded. The 

25 oligonucleotide can be modified at the base moiety, sugar moiety, or phosphate backbone. 
The oligonucleotide may include other !q>pending groups such as peptides, or agents 
fecilitating transport across the cell membrane (see, e.g., Letsinger et al., 1989, Proc. Natl. 
Acad. Sci. U.S.A. 86: 6553-6556; Lemaitre et al., 1987, Proc. NaU. Acad. Sci. 84: 648-652; 
PCT PubUcationNo. WO 88/09810, published December 15, 1988), hybridization-triggered 

30 cleavage agents {see, e.g.. Krol et ai, 1988, BioTechniques 6; 958-976) or intercalating 
agents (see. e.g.. Zon, 1988, Pharm. Res. 5: 539-549). 

In a preferred aspect of the invention, an antisense oligonucleotide is provided, 
preferably^ single-stranded DNA. Hie oligonucleotide may be modified at any position on 
its structure with constituents generally known in the art. 

35 The antisense oligonucleotides may comprise at least one modified base moiety 

which is selected fiom the group including but not limited to 5-fiuorouracil, 5-bromouracil, 
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5-chlorouracil, 5-iodouracil, hypoxanthine, xanthine, 4-acetylcytosme, 
5-(carboxyhydroxylmethyl) uracil, 5-carboxymethylanunomethyl-2-thiouridine, 
5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, 
N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 
5 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-adenine, 
7-methylguanine, S-methylaminomethyluracil, 5-methoxyaminoniethyl-2-thiouracil, beta- 
D-mannosylqueosine, 5'-methoxycarboxymethyluracil, S-methoxyuracil, 2-methylthio-N6- 
isopentenyladenine, uracil-5-oxyacetic acid (v), wybutoxosine, pseudouracil, queosine, 

2- thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouiacil, S-methyluracil, uracil- 

10 5-oxyacetic acid methylester, uracil-5-oxyacetic acid (v), 5-methyl-2-thiounicil, 3-(3-ainin<>- 

3- N-2-carboxypropyl) uracil, (acp3)w, and 2,6-diaminopurine. 

In another embodiment, the oligonucleotide comprises at least one modified sugar 
moiety selected &om the group including, but not limited to, arabinose, 2-fluoroarabinose, 
xylulose, and hexose. 

IS In yet another embodiment, the oligonucleotide comprises at least one modified 

phosphate backbone selected from the group consisting of a phosphorothioate, a 
phosphorodithioate, aphosphoramidothioate, aphosphoramidate, aphosphordiamidate, a 
methylphosphonate, an alkyl phosphotriester, and a formacetal or analog thereof. 

In yet another embodiment, the oligonucleotide is a 2-a-anomeric oligonucleotide. 

20 An a-anomeric oligonucleotide forms specific double-stranded hybrids with complementary 
RNA in which, contrary to the usual B-units, the strands run parallel to each other (Gautier et 
aL, 1987, Nucl. Acids Res. 15: 6625-6641). 

The oligonucleotide may be conjugated to another molecule, e.g., a peptide, 
hybridization triggered cross-linking agent, transport agent, hybridization-triggered cleavage 

25 agent, etc. 

The antisense nucleic acids of the invention comprise a sequence complementary to 
at least a portion of a target RNA species. However, absolute complementarity, although 
preferred, is not required. A sequence "complementary to at least a portion of an RNA," as 
refrared to herein, means a sequence having sufficient complementarity to be able to 

30 hybridize with the RNA, forming a stable duplex; in the case of double-stranded antisense 
nucleic acids, a single strand of the duplex DNA may thus be tested, or triplex formation 
may be assayed. The ability to hybridize will depend on both the degree of complementarity 
and the length of the antisense nucleic acid. Generally, the longer the hybridizing nucleic, 
acid, the more base mismatches with a target RNA it may contain and still form a stable 

35 duplex (or triplex, as the case may be). One skilled in the art can ascertain a tolerable degree 
of mismatch by use of standard procedures to determine the melting point of the hybridized 
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complex. The amount of antisense nucleic acid that wiU be effective in the inhibiting 
translation of the target RNA can be determined by standard assay techniques. 

Oligonucleotides of the invention may be synthesized by standard methods known in 
the art, eg. by use of an automated DNA synthesizer (such as are commercially available 

5 from Biosearch, AppUed Biosystems, etc.). As examples, phosphorothioate oUgonucleotides 
may be synthesized by the method of Stein et al. (1988, NucL Acids Res. 16: 3209), 
methylphosphonate oUgonucleotides can be prepared by use of controlled pore glass 
polymer supports (Sarin et al., 1988, Proc. Natl. Acad. Sci. U.S.A. 85: 7448-7451), etc. In 
another embodiment, the oligonucleotide is a 2'-0-methyIribonucleotide (Inoue et al, 1987, 

10 Nucl. Adds Res. 15: 6131-6148), or a chimeric RNA-DNA analog (Inoue et al., 1987, 
FEBS Lett 215: 327-330). 

The synthesized antisense oligonucleotides can then be administered to a cell in a 
controUed manner. For example, the antisense oligonucleotides can be placed in the growth 
environment of the cell at controUed levels where they may be taken up by the cell. The 

15 uptake of the antisense oUgonucleotides can be assisted by use of methods weU known in the 
art 

In an altwnative embodiment, the antisense nucleic acids of the invention are 
controllably expressed intracellularly by transcription from an exogenous sequence. For 
example, a vector can be introduced in vivo such that it is taken up by a cell, within which 

20 cell the vector or a portion thereof is transcribed, producing an antisense nucleic acid (KNA) 
of the invention. Such a vector would contain a sequence encoding the antisense nucleic 
acid. Such a vector can remain episomal or become chromosomally integrated, as long as it 
can be transcribed to produce the desaed antisense RNA. Such vectors can be constructed 
by recombinant DNA technology methods standard in the art Vectors can be plasmid, viral, 

25 or others known in the art, used for replication and expression in mammalian ceUs. 

Expression of the sequences encoding the antisense RNAs can be by any promoter known in 
the art to act in a cell of interest Such promoters can be inducible or constitutive. Most 
preferably, promoters are contix)llable or inducible by the administration of an exogenous 
moiety in order to achieve controUed expression of the antisense oUgonucleotide. Such 

30 controUable promoters include the let promoter. Less preferably usable promoters for 
mammalian ceUs include, but are not Umited to: the SV40 early promoter region (Bemoist 
and Chambon, 1981, Nature 290: 304-310), the promoter contained in the 3' long terminal 
repeat of Rous sarcoma virus (Yamamoto et al, 1980, CeU 22: m-191), tiie herpes 
thymidine kinase promoter (Wagner et al, 1981, Proc. Natl. Acad. Sci. U.S.A. 78: 1441- 

35 1445), the regulatory sequences of the metaUothionein gene (Brinster et al, 1982, Nature 
296: 39-42), etc. 
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Therefore, antisense nucleic acids can be routinely designed to target virtually any 
mRNA sequence, and a cell can be routinely transformed with or exposed to nucleic acids 
coding for such antisense sequences such that an effective and controllable amount of the 
antisense nucleic acid is expressed. Accordingly the translation of virtually any RNA 
5 species in a cell can be controUably perturbed. 

Finally, in a fiarther embodiment, RNA aptamers can be introduced into or expressed 
in a cell. RNA aptamers are specific RNA ligands for proteins, such as for Tat and Rev 
RNA (Good et aL, 1997, Gene Therapy 4: 45-54) that can specifically inhibU their 
translation. 

10 

5.8.4. MFTHODS OF MODIFYING PROTEIN ABUNDANCES 
Mcfliods of modifying protein abundances include, irtier alia, those altering protein 
degradation rates and those using antibodies (which bind to proteins affecting abundances of 
activities of native target protein species). Increasing (or decreasing) the degradation rates 
15 ofaprotein species decreases (or increases) the abundance of that species. Methods for 
controUably increasing the degradation rate of a target protein in response to elevated 
temperature and/or exposure to a particular drug, which are known in the art, can be 
employed in this invention. For example, one such method employs a heat-inducible or 
drug-mducible N-terminal degron, which is an N-tenninal protein fragment that exposes a 
20 degradation signal promoting rapid protein degradation at a higher temperature (e.g.. 37" C) 
and which is hidden to prevent rapid degradation at a lower temperatare (e.g., 23° C) 
(Dohmen et al, 1994, Science 263: 1273-1276). Such an exemplary degron is Arg-DHFR", 
a variant of murine dihydrofolate reductase in which the N-tenminal Val is replaced by Arg 
and the Pro at position 66 is replaced with Leu. According to this method, for example, a 
25 gene for a target protein, P, is replaced by standard gene targeting methods known in the art 
(Lodish et al., 1995, Mnlecular BiolQ gv of the Cell. W.H. Freeman and Co., New York, 
especially chap 8) with a gene coding for the fiision protein Ub-Arg-DHFR**-P ("Ub" stands 
for ubiquitin). The N-terminal ubiquitin is rapidly cleaved after translation exposing the N- 
terminal degron. At lower temperatures, lysines internal to Arg-DHFR" are not exposed, 
30 ubiquitination of the fiision protein does not occur, degradation is slow, and active target 
protein levels are high. At higher temperatures (in the absence of methotrexate), lysines 
internal to Arg-DHFR" are exposed, ubiquitination of the fiision protein occurs, degradation 
is rapid, and active target protein levels are low. Heat activation of degradation is 
controUably blocked by exposure methotrexate. This method is adaptable to other N- 
35 terminal degrees which are responsive to other inducing factors, such as drugs and 
temperature dianges. 
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Target protein abundances and also, directly or indirectly, their activities can also be 
decreased by (neutralizing) antibodies. By providing for controUed exposure to such 
antibodies, protein abundances/activities can be controUably modified. For example, 
antibodies to suitable epitopes on protein surfiices may decrease the abundance, and thereby 
5 indirectly decrease the activity, ofthewUd-type active form ofa target protein by 

aggregating active forms into complexes with less or minimal activity as compared to the 
wild-type unaggregated wild-type form. Alternately, antflxxlies may direcUy decrease 
protein activity by, e.g.. interacting direcUy with active sites or by blocking access of 
substrates to active sites. Conversely, in certain cases, (activating) antibodies may also 
10 interact with proteins and their active sites to increase resulting activity. In either case, 
antibodies (of the various types to be described) can be raised against specific protein 
species (by the methods to be described) and their effects screened. The effects of the 
antibodies can be assayed and suitable antibodies selected that raise or lower the target 
protein species concentration and/or activity. Such assays involve introducing antibodies 
15 into a cell (see below), and assaying the concentration of the wild-type amount or activities 
of the target protein by standard means (such as immunoassays) known in the art. The net 
activity of the wild-type form can be assayed by assay means appropriate to the known 
activity of the target protein. 

Antibodies can be introduced into cells in numerous fashions, including, for 
20 example, microinjection of antibodies into a cell (Morgan el ai, 1988, faununology Today 
9:84-86) or transforming hybridoma mRNA encoding a desired antibody into a cell (Burke 
et al., 1984, Cell 36:847-858). In a fiirther technique, recombinant antibodies can be 
engineering and ectopically expressed in a wide variety of non-lymphoid cell types to bind 
to target proteins as weU as to block target protein activities (Biocca et al, 1995, Trends in 
25 Cell Biology 5:248-252). Preferably, expression ofthe antibody is under control of a 
controUable promoter, such as tiie Tet promoter. A first step is ttie selection of a particular 
monoclonal antibody with appropriate ^ecificity to the target protein (see below). Then 
sequences encoding the variable regions of ttie selected antibody can be cloned into various 
engineered antibody foimats, including, for example, whole antibody. Fab fragments, Fv 
30 fragments, single chain Fv fragments (Vh and regions united by a peptide linker) ("ScFv" 
fragments), diabodies (two associated ScFv fragments with different specificities), and so 
forth (Hayden et al, 1997, Current Opinion in Immunology 9:210-212). Intracellularly 
expressed antibodies of die various formats can be targeted into ceUular compartments {e.g^. 
die cytoplasm, the nucleus, the mitochondria, etc.) by expressing them as fusions with the 
35 various known intraceUular leader sequences (Bradbury et al, 1995, Antibody Engineering 
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(vol 2) (Borrebaeck ed.), pp 295-361, ERL Press). In particular, the ScFv format appears to 
be particularly suitable for cytoplasmic targeting. 

Antibody types include, but are not limited to, polyclonal, monoclonal, chimeric, 
single chain. Fab fragments, and an Fab expression library. Various procedures knoivn in 

5 the art may be used for the production of polyclonal antibodies to a target protem. For 
production of fhe antibody, various host animals can be immunized by injection with the 
target protein, such host axiimals include, but are not limited to, rabbits, mice, rats, etc. 
Various adjuvants can be used to increase the immunological response, depending on the 
host species, and include, but are not limited to, Freund*s (complete and incomplete), 

10 mineral gels such as aluminum hydroxide, sur&ce active substances such as lysolecithin, 
pluronic polyols, polyanions, peptides, oil emulsions, dinitrophenol, and potentially use&l 
human adjuvants such as bacillus Cahnette-Guerin (BCG) and corynebacterium parvum. 

For preparation of monoclonal antibodies directed towards a target protein, any 
technique that provides for the production of antibody molecules by continuous cell lines in 

15 culture may be used. Such techniques include, but are not restricted to, the hybridoma 
technique originally developed by Kohler and Milstein (1975, Nature 256: 495-497), the 
trioma technique, the human B-cell hybridoma technique (Kozbor et a/., 1983, Immunology 
Today 4: 72), and the EBV hybridoma technique to produce human monoclonal antibodies 
(Cole et al,y 1985, in Monoclonal Antibodies and Cancer Therapy, Alan R. Liss, Inc., pp. 

20 77-96). In an additional embodiment of the invention, monoclonal antibodies can be 
produced in germ-free animals utilizing recent technology (PCT/US90/02545). According 
to the invention, human antibodies may be used and can be obtained by using human 
hybridomas (Cote et al, 1983, Proc. Natl. Acad. Sci. USA 80: 2026-2030), or by 
transforming human B cells with EBV virus in vitro (Cole et a/., 1985, in Monoclonal 

25 Antibodies and Cancer Therapy, Alan R. Liss, Inc., pp. 77-96). In fact, according to the 
invention, techniques developed for the production of "chimeric antibodies" (Morrison et 
al, 1984, Proc. Natl. Acad. Sci. USA 81: 6851-6855; Neubcrger et a/., 1984, Nature 
312:604-608; Takeda et al, 1985, Nature 314: 452-454) by splicing the genes from a mouse 
antibody molecule specific for the target protein together with genes from a human antibody 

30 molecule of s^piopriate biological activity can be used; such antibodies are within the scope 
of this invention. 

Additionally, where monoclonal antibodies are advantageous, they can be 
alternatively selected from large antibody libraries using the techniques-of phage display 
(Marks et a/., 1992, J. Biol. Chem. 267:16007-16010). Using this technique, libraries of up 
35 to 10*^ different antibodies have been expressed on the surface of fd filamentous phage, 
creating a "single pot" in vitro inumme system of antibodies available for the selection of 
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monoclonal antibodies (Griffiths et al., 1994, EMBO J. 13:3245-3260). Selection of 
antibodies from such libraries can be done by techniques known in the art, including 
contacting the phage to immobilized target protein, selecting and cloning phage bound to the 
target, and subcloning the sequences encoding the antibody variable regions into an 

5 impropriate vector expressing a desired antibody format. 

According to the invention, techniques described for the production of single chain 
antibodies (U.S. patent 4,946,778) can be adapted to produce single chain antibodies specific 
to the target protein. An additional embodiment of the invention utilizes the tedmiques 
described for the construction of Fab expression libraries (Huse et al., 1989, Science 246: 

10 1275-1281) to allow rapid and ea^ identification of monoclonal Fab fragments wiUi the 
desired specificity for ttie target protein. 

Antibody fragments that contain the idiotypes of the target protein can be generated 
by techniques known in the art For example, such fiagments include, but are not limited to: 
the F(ab')2 fragment which can be produced by pepsin digestion of the antibody molecule; 

15 the Fab' fragments that can be goierated by reducing the disulfide bridges of the F(ab')2 
fragment, the Fab fragments that can be generated by treating the antibody molecule with 
papain and a reducing agent, and Fv fragments. 

In the production of antibodies, screening for the desired antibody can be 
accompUshed by techniques known in the art, e.g.. ELISA (enzyme-linked immunosorbent 

20 assay). To select antibodies specific to a target protein, one may assay generated 
hybridomas or a phage display antibody library for an antibody that binds to the target 
protein. 

5.8.5. MPTHQDS OF MODTFYTNG P ttnTFTN ACTIVITIES 
25 Methods of directly modifying protein activities include, inter alia, dominant 

negative mutations, specific drugs (used in the sense of this application) or chemical 
moieties generaUy, and also the use of antibodies, as previously discussed. 

Dominant negative mutations are mutations to endogoious genes or mutant 
exogenous genes that when expressed in a cell disrupt the activity of a targeted protein 
30 species. Depending on the structure and activity of the targeted protein, general rules exist 
that guide the selection of an appropriate strategy for constructing dominant negative 
mutations that disrupt activity of that target (Hershkowitz, 1987, Nature 329:219-222). In 
the case of active monomeric forms, over expression of an inactive form can cause 
competition for natural substrates or ligands sufficient to significantly reduce net activity of 
35 the target protdn. Such over expression can be achieved by, for example, associating a 
promoter, preferably a controllable or inducible promoter, of mcreased activity with the 
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mutant gene. Alternatively, changes to active site residues can be made so that a virtuaUy 
irreversible association occurs with the target Ugand. Such can be achieved with certam 
tyrosine kinases by careful replacement of active site serine residues (Perhnutter et al., 1996. 
Cunent Opinion in Immunology 8:285-290). 
5 In the case of active multimeric forms, several strategies can guide selection of a 

dominant negative mutant. Multimeric activity can be controllably decreased 

of genes coding exogenous protein fragments that bind to multimeric association domams 
and prevent multimer formation. Alternatively, controllable over expression of an inactive 
protein unit of a particular type can tie up wUd-type active units in inactive multimers, and 
10 thereby decrease multimeric activity (Nockae/ a/.. 1990. llieEMBO J. 9:1805-1813). For 
example, m the case of dimeric DNA binding proteins, the DNA binding domain can be 
deleted from the DNA binding unit, or the activation domain deleted from the activation 
miit Also, in this case, the DNA binding domain unit can be expressed without the domam 
causing association with the activation unit, niereby, DNA binding sites are tied up without 
15 any possible activation of expression. In the case where a particular type of unit nomially 
undergoes aconformational change during activity, expression of arigid umt can inactivate 
resultant complexes. For a fiirther example, proteins involved in cellular mechanisms, such 
as cellular motility, tiw mitotic process, cellular architecture, and so forth, are typically 
composed of associations of many subunits of a few types. These stiiictures are often highly 
20 sensitive to disruption by inclusion of a few monomeric units with structural defects. Such 
mutant monomers disrupt tfie relevant protein activities and can be controUably expressed m 
acell. 

In addition to dominant negative mutations, mutant target proteins that are sensitive 
to temperature (or other exogenous factors) can be found by mutagenesis and screening 
25 procedures that are well-known in the art. 

Also, one of skill in the art will appreciate tiiat expression of antibodies binding and 
inhibiting a target protein can be employed as anotiier dominant negative sti^egy. 

Finally, activities of certiun target proteins can be conteoUably altered by exposure to 
exogenous drugs or ligands. In a preferable case, a drug is known that interacts with only 
30 one toirget protein in the ceU and alters the activity of only that one target protein. Graded 
exposure of a cell to varying amounts of that drug tiiereby causes graded perfanbations of 
ceUular states originating at that protein. The alteration can be either a decrease or an 
increase of activity. Less preferably.a drug is known and used that alters the activity of only 
a few ie.g., 2-5) target proteins with separate, distinguishable, and non-overlapping effects. 
35 Graded exposure to such a drug causes graded pertiirbations to the several cellular states 
originating at tiie target proteins. 
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fi. EXAMPLES 

•nie following examples are presented by way of iUustration of the previously 
described invention and are not limiting of that description. 

6.1. t:yaMPT.F 1 : C I TTSTF.RTNG GEN ^^^r<i RY rQRF.GULATIOM 
This example illustrates one embodiment of the clustering method of the invention. 

6.1.1. MATERIALS A >Jr> METHODS 

Transcript measurem«it: x 
10 Yeast (Saccharomyces ceremiae. Strain YPH499. see, Sikorski and Hieter. 1989, A 

system of shuttle vectors and yeast host strains designated for efficient manipulation of DNA 
in Saccharomyces ceremiae. Genetics 122:19-27) cells were grown in YAPD at 30" C to an 
OD«o of 1.0 (±0.2), and total RNA prepared by breaking cells in phenol/chloroform and 
0 1% SDS by standard procedures (Ausubel et al., 1995, Current Protocols in Molecular 
15 Biology. Greene PubUdungandWiley-InterscienccNewYoricCh. 13). Poly(A)*RNA 
was selected by affinity chromatography on oligo-dT cellulose (New England Biolabs) 
essentially as described in Sambrook et al. (Molecular Cloning - A Laboratory Manual (2nd 
Ed ) Vol. 1, Cold Spring Harbor Laboratory. Cold Spring Harbor. New York, 1989). First 
strand cDNA synthesis was carried out with 2 ng poly(A)* RNA and Superscript™ n 
20 reverse transcriptase (Gibco-BRL) according to the manufacturer's instructions with the 
following modifications. Deoxyribonucleotides were present at the followmg 
concentrations: dA. dG. and dC at 500 jiM each. dT at 100 ^M and either Cy3-dUTP or 
Cy5-dUTP (Amersham) at 100 jiM. cDNA synthesis reactions were carried out at 42-44 C 
for 90 mmutes. after which RNA was degraded by the addition of 2 units of RNAse H. and 
25 the cDNA products were purified by two successive rounds of centrifiigation dialysis usmg 
MlCROCON-30 microconcentrators (Amicon) according to the manufecturer-s 

recommendations. . 

Double-stranded DNA polynucleotides corresponding in sequence to each ORE m 
the S cerevisiae genome encoding a polypeptide greater than 99 amino acids (based on the 

30 pubUshed yeast genomic sequence. e.g.. Goffeau et al, 1996. Science llAM-Sei) are 
made by polymeiase chain reaction (PGR) amplification of yeast genomic DNA. Two PGR 
primers are chosen internal to each of the ORFs according to two criteria: (i) the amplified 
ftagments arc 300-800 bp and (ii) none of the firagments have a section of more than 10 
consecutive nucleotides of sequence in common. Computer programs are used to aid m the 

35 design ofthe PGR primers. Amplification is carried out in 96 well miciotitre plates. IHe 
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resulting DNA fragments are printed onto glass microscope slides using the method of 

Sbalon et al., 1996, Genome Research 6:639-645. 

Huorescently-labeled cDNAs (2-6 jig) are resuspended in 4 X SSC plus 1 jig/jil 

tRNA as carrier and filtered using 0.45 jiM filters (MiUipore, Bedford, MA). SDS is added 
5 to 0.3%, prior to heating to 100" C for 2 minutes. Probes are cooled and immediately^ 

hybridized to the microarrays produced as described in Example 6.2. for 4 hours at 65 ° C. 

Non-hybridized probe is removed by washing in 1 X SSC plus 0.1% SDS at ambient 

temperature for 1-2 minutes. Microarrays are scamied with a fluorescence laser-scamung 

device as previously described (Schena et al., 1995, Science 270:467^70; Schena et al., 
10 m5,Proc. Natl. Acad. Sci. 93:10539-11286) and the results (including the positions 

of perturbations) are recorded. 

Perturbations: This example involved 18 experiments including different drug treatments 
and genetic mutations related to yeast S. cerevisaie biochemical pathway homologous to 
immunosuppression in humaa Two drugs. FK506 and Cyclosporin were used at the 
15 concentrations of 50 jigtol or 1 fig/ml in combination with various gene deletions. Genes 
CNAl and CNA2 encode the catalytic subunits of calcineurin. Cardenas et al., 1994, Yeast 
as model T cells, in Perspectives in Drug Discovery and Design. 2:103-126. The 18 

differrait experiment conditions are listed in Table 1 : 



20 



25 



30 



35 
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10 



15 



20 



1 
2 
3 

4 

5 

6 

7 

8 

9 

10 

11 

12. 

13 

14 
15 
16 
17 
18 



+/-FK506(50jigtol) 
+/-FK 506(1 ng/ml) 
-CPHl +/- FK 506 (50 ng/ml) 
-CPHl+Z-FK 506(1 ^g/ml) 
-FPR +/- FK 506 (50 jig/ml) 
-FPR +/- FK 506 (1 jig/ml) 
-CNAl, CNA2 +/- FK 506 (50 jig/ml) 
-CNAl, CNA2 +/- FK 506 (1 ^g/lnl) 
.GCN4 +/- FK 506 (50 fig/ml) 
-CNAl. CNA2, FPR +/- FK 506 (50 ng/ml) 
-CNAl, CNA2, FPR +/- FK 506 (1 ng/ml) 
.GCN4 +/- Cyclosporin A (50 |ig/inl) 
-FPR +/- Cyclosporin A (50 ng/ml) 
+/- Cyclosporin A (50 jig/ml) 
-CNAl, CNA2, CPHl +/- Cyclosporin A (50 ng/ml) 
-CNAl, CNA2 +/- Cyclosporin A (50 jig/ml) 
-CPHl +/- Cyclosporin A (50 |ig/ml) 
-/+CNA1. CNA2 



Ouster analysis: The set of more than 6000 measured mRNA levels was first reduced to 48 
by selecting only those genes which had a response ampUtude of at least a fector of 4 m at 
least one of the 18 experiments. The initial selection greatly reduced the effect of 
measurement errors, which domonated the small responses of most genes m most 

30 experiments. * ^ ^ vi-. iq 

Clustering using the hclust routine was perfonmd on the resulting dau tAle 18 
(experiments)x48(8e«=s).TheeodeWVasnm.singS^lu.4.0»Wi»^ 
workstaaon. Thedistancewas 1 -nwh.re.herislh.oo.id«.on««ffi<n«.t(t»nn^ 
dotproduct). S.adstic.lsignifie«»:..f««=hbranebnode«aso>..put«i^.heM2^ 

35 carioproceduredeseribedp^viouslyherein. One hundred realizations of ^«t«ldau. 
were cheered to derive an empirical inv«>v«ne«t (in compactness) score for each 
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bifucation. Thescoreforthcunpetmuteddataisthenexpressedmst^ 
values are indicated on the tree of FIG. 6. 

6.1.2. pPCTn.TS AND nTSniSSION 
5 no. 6 shows the clustering tree derived from 'hciust' algorithm operating on the 

18x48 data table. The 48 genes were clustered into various branches. The vertical 
coordinate at the horizontal comiector joining two branches indicates the distance between 
branches. Typical values are in the range of 0.2-0.4 where 0 is perfect correlation and 1 is 
zero correlation. The number at the branch is the statistical significance. Numbers greater 
10 than about 2 indicate that the branching is significant at the 95% confidence level. 

The horizontal line of FIG. 6 is the cut off level for defining genesets. This leve is 
arbitrarily set. n»ose branches with two or fewer members were ignored for fiirAer analysis. 

Ttoee genesets with three or more members were defined at this cut off level. TTie 
significancevalues(instandaiddeviations)shownatthebranchnotesweredenvedas 

15 described,andshowthatthethreebranchesaretrulydistinct The clusters correspond to the 
pathways involving the calcineurin protein, the PDR gene and the Gcn4 transcription fector. 
which indicates that cluster analysis is capable of producing genesets that have 
corresponding genetic regulation pathways. See, Marton et al, Dmg Target validation and 
identification of secondary drug target effects using DNA microarrays. Nature Medictne 

20 4'.\m-n(i\. 

T^^^T^fnT... p.„;,KrTNG D i^TP^TnKnFT^F.SPONSFPATTF,RNUgINq 

fiFNFSRT A> ^P ACtF. response 
This example illustrates enhanced detection of a particular response pattern by 

25 geneset averaging. . 

Geneset number 3 in the clustering analysis result of HG. 6 involves genes regulated 
by the Gcn4 transcription factor. This was verified via the literature and via multiple 
sequence aligmnent analysis of the regulatory regions 5' to tiie individual genes (Stomio and 
Hartzell 1 989. Identifying protein binding sites ftom unaligned DNA firagments. EEOSiM 

30 Acad Sci 86:1183-1187; Hertz and Stormo. 1995. Identification of consensus patterns m 
unaligned DNA and protein sequences: a large^eviation statistical basis for penalizing gaps 

. p.. inti rnnf on fiirf mrli-^ ""^ n^nn,. Research. Lim and Cantor, eds.. World 

Scientific PubUshing Co.. Ltd. Singapore, pp. 201-216). Twenty of^2 genes in g«ieset 3 
hadacommonpromotersequenceappropriatetoGcn4. These 20 were used to define a 

35 geneset. Response profiles to a titration series ofthe dmg FK506. which is known to hit *i^ 
pathway at higher concentrations, were projected onto this geneset. The resulting projected 
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• A^f^ 'Genesef in Table 2. where the responses (in standard deviations of 
uM. The 'Geneset' response becomes very significant (>3 sigma) at i.o ng/ , 

TrtkJ. P " " ''"'^ 

Concaitration Qig/ml) 




10 



15 



20 



Gene 
YBR047W 
YER024W 
ARG5.6 
YGL117W 
YGL184C 
ARG4 
YHR029C 
HISS 
CPA2 
SNOl 
SNZl 

YMR195W 

NCE3 

ARGl 

HIS3 

SSUl 

MET16 

ECM13 

AR03 

PCL5 

Geneset 
35 Average 



25 



30 



0.1 
0.0781 
0.1985 
0.1162 
0.6309 
0.0654 
0.3585 
-0.031 
0.0292 
NaN 
-0.2899 
-0.7223 
0.7615 
0.0371 
0.2083 
-0.3719 
0.6257 
0.0225 
0.1269 
NaN 
0.1418 
0.1728 



0.31 
0.1553 
-0.0419 
0.2722 
0.6768 
-0.0207 
0.3508 
0.2438 
0.2175 
NaN 
0.0244 
0.0244 
0.3323 
0.1668 
0.3436 
0.1282 
0.6655 
-0.6269 
0.2197 
-0.1371 
0.2767 
0.6753 



1.6 
0.2806 
0.4868 
1.1844 
1.6276 
-0.0731 
1.6674 
0.4421 
0.9802 
1.2429 
-0.4772 
-0.4772 
1.6021 
1.2896 
3.1765 
0.71 
0.2883 
-0.1885 
0.5226 
0.2684 
0.4127 
3.3045 



7.5 
1.1596 
1.1526 
2.7433 
2.699 
-0.4586 
3.2973 
2.3813 
2.8414 
NaN 
2.538 
2.538 
0.8879 
1.569 
4.2215 
1.8024 
0.5059 
0.1621 
2.5343 
0.6059 
2.2898 
7.8209 



16 
3.3107 
4.6342 
6.0457 
4.9827 
2.7166 
4.5135 
5.0446 
6.0052 
4.1093 
5.8877 
5.8877 
4.0983 
5.5819 
4.711 
4.6461 
4.6461 
3.3857 
4.8689 
4.0553 
' 5.4688 
19.9913 



50 
4.248 
5.8934 
5.2406 
5.9066 
5.3106 
5.8858 
5.5781 
4.9557 
4.0958 
5.5665 
5.5665 
4.6141 
3.3928 
5.7996 
5.2637 
3.5782 
4.855 
3.1882 
5.7035 
5.2339 
21.3315 
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^ ^y^l^ ffj f ^. TMPPnVED CT ASSTFTCATTON OF DRTTG ACnVITY 
■me IS-experimeiit data set mentioned in Example 1. st^ra, was combmed with an 
additional 16 experiments using a variety of perturbations including immunosuppressive 
dmgs FK506 and Cyclosporin A, and mutations in genes relevant to the activity of those 
drugs; and drugs unrelated to unmunosuppression hydroxyurea. 3-Aminotriazole, and 
methotrexate. The experimental conditions are'Usted in Table 3: 

Table 3. Additional 16 Experiments 



10 



15 



20 



25 



1 


3-Aminotriazole (0.01 mM) 


2 


3-Aminotriazole (1 mM) 


3 


3-Aminotriazole (10 mM) 


4 


3-Aminotriazole (100 mM) 


5 


Hydroxyurea (1.6 mM) 


6 


Hydroxyurea (3.1 mM) 


7 


Hydroxyurea (6.2 mM) 


8 


Hydroxyurea (12.5 mM) 


9 


Hydroxyurea (25 mM) 


10 


Hydroxyurea (50 mM) 


11 


Methotrexate (3.1 jiM) 


12 


Methotrexate (6 jxM) 


13 


Mefliotrexate (25 jiM) 


14 


Methotrexate (50 jiM) 


15 


Methotrexate (100 jxM) 


16 


Methotrexate (200 yM) 



'° A cluster analysis was perfomied with the combined data set. A first down selection 

of genes was done by requiring the genes to have a significant response in 4 or more of the 
34 experiments, where this threshold was defined precisely as greater than twofold up- or 
down-regulation. and a confidence level of 99%. or better. This selection yielded 194 genes. 
Uss stringent thresholds would yield moie genes and higher incidence of measurement 
errors contaminating the data and confiising the biological identifications of the genesets; 
however, the final results are not very sensitive to this threshold. 
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nie 'hclust' procedure of S-Plus was used, giving the clustering tree shown in FIG. 
7 THeie are 16 genesets at the cut level D = 0.4 shown in HG. 7. Of these 16. 7 consist of 
two genes or less. Discarding these smaU clusters leaves 9 major clusters marked as shown 
in FIG 7 with numbers 1-9. All the resulting bifiircations above the cut level are significant 
5 (more than two sigma - see numbers at each node), so the clusters are truly distmct 

It is noteworthy that genesets defined by the immunosuppressive drug pathways are 
again identified here even though non-immunosuppressive drug response data arc combmed 
in the analysis. 

Geneset 2 contains the calcineurin dependent genes fiom Geneset 1 of FIG. 6. while 
10 Geneset 4 contains the Gcn4-dependent genes firom Geneset 3 of HG. 6. 

The response to FK506 at 1 6 ^g/ml was obtained and the response profile was used 
as ••unknown" profile. The response profile was projected into the genesets defined using 
the cluster analysis of the 34 experiments. The 34 profiles fiom the individual experiments 
from the clustering set also were projected onto the basis. 
15 Hie projected profile for FK506 at 16 ^g/ml was compared with each of the 34 

projected profiles from the clustering set. Five of these comparisons are iUustrated in FIGs 
8A-8E, and will be discussed in more detail below. 

The correlation between the projected profile of the unknown, and the projected 
profile of each of the 34 training experiments was calculated using the equation 10 (Section 
20 5 4.2. supra) and is displayed as circles (-0-) in FIG. 9. 

Also displayed for comparison are the correlation coefficients computed without 
projection (-a-), and without projection but with restriction to those genes that were up- or 
down-regulated at die 95% confidence level, and by at least a factor of two. m one or the 
other of the two profiles (-0-). 
25 In general, the projected correlation coefficients track the unprojected ones, and show 

larger values The larger values are a consequence of the averaging out of measurement 
errors which occurs during projection onto the genesets. These individual measurement 
errors tend to bias the unprojected correlation coefficients low. and this bias is reduced by 

the projection operation. 

30 The correlation coefficient of the projected profiles tends to have large errors when 

the original profile response was very weak and noise-dominated. Such is the case at some 
ofthelowerconcentrationsofdrugtreatment including Experiments 1,2.7,8. InExpenment 
2. for exarnple. there is a projected correlation coefficient of negative 0.45. where the 
unprojected correlations are close to zero. This is a consequence of noise dominance of die 

35 correlation coefficient FIG. 8A shows that treatinentwitiiHU at 3.1 mM (gray bars) hasa 
very weak projected profile. 
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FIG. 8B gives the elements of the projected profiles for the comparison of FK506 at 
16Mg/ml (the unknown) with Experiment No. 25 in HG. 9, FK506 at 50 jig/ml. The 
projected profiles are highly consistent with the very high correlation values in FIG. 9. The 
largest response is in Geneset 7, which corresponds biologically to an amino acid starvation 
5 response evidently triggered at large concentrations of the drug. The response in Geneset 5 
is mediated via flie primary target of the drug, the calcineurin protein. This response is still 
present at lower concentrations of the drug ( HG. 8C, gray bars, FKSOd at 1 ng/ml), while 
the response in Geneset 7 and other Genesets is greatly reduced. This biological 
interpretation is an immediate aid in classification of drug activity. It can be concluded that 
10 tiie higher concentration of the drug has triggered secondary, (probably undesirable), 
pathways. One ofthe primary mediators ofthese pathways turns out to be the transcription 
fector Gcn4, as shown by the grey profile in HG. 8D from Experiment 34 listed in FIG. 8A. 
Here, the activity in Genesets 23, and 7 is removed by the deletion of the GCN4 gene. 

However, blind classification using the projected profiles also is improved. Note that 
15 the projected correlation coefficients show that the next-nearest neighbor to the unknown is 
the experiment two rows above the best match, '-cph +/- FK506 at 50 ng/ml'. This is 
treatment with the drug of cells genetically deleted for the gene CPHl . This gene is not 
essential to the activity of FK506, and should not greatly change the response. Thus the 
projected profile correctly shows a high similarity with the unknown, FK506 at 16 ng/ml. 
20 The unprojected correlation coefficients, however, declare the experiment six rows above the 
best match, '-cna +/- FK506 at 50 jig/ml', to be the second best match. This experiment 
involves treatment with the drug of cells genetically deleted for the primary target, 
calcineurin. In this case, the response to Geneset 5, mediated by calcineurin, has 
disappeared (see FIG. 8E) while the other responses ranain. This important biological 
25 difference is reflected in the projected elements of FIG. 8E and in the projected correlation 
coefficients, but not in the unprojected correlation coefficients. Thus conclusions about 
biological similarity would be more reliable in this case based on the projected correlation 
coefficients using the method ofthe invention than based on unprojected methods. 

30 6.4 P-YPKRTMENT4r fvrPRQVED rT ASSTFTCATIQN OF 

BIOLOGICAL RESPONSE PROFILES 
The 34-experiinent data set described in Example 3 (Section 6.3, supra) was also 
analyzed by two-dimensional cluster, analysis. In particular, cluster analysis was first 
performed with the data set to identify genesets as described in Example 3, supra. Next, the 
35 'hclust' procedure of S-Plus was used again, this time to organize the biological response 
profiles according to the similarity of the biological response. 
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nie results of this analysis are illustrated in Fig. 16. Fig. 16A shows a gray scale 
display of the plurality of reduced genetic transcripts (horizontal axis) measured in the 34- 
experiments (vertical axis). Thus, each row in Fig. 16A indicates the response of genetic 
transcripts to a particular perturbation (e.g., exposure to a particular drug). The gray scale 

5 represents the logarithm of measured expression ratio as shown in the gray scale bar on the 
top of Fig 16. Specifically, black denotes up regulation ofatnmscript(+l), whereas white 
denotes down regulation (-1), and the middle gray scale (0) denotes no change in expression. 
Fig 16B illustrates co-regulation tree of genetic transcriptions (i.e.. the colunms m Fig. 16A) 
into genesets described in Example 3. supra. IHe column index order represented in this co- 

1 0 regulation tree was then used to re-order the column in Fig. 1 6A to generate the display 
shown in Fig. 16C. The same clustering algorithm was then ^Ued to the row in Fig. 16C 
(i.e.. to the response profiles), and the row index was similarly re-ordered to generate Fig. 
16D. 

Comparing Figs. 16A and 16D. large structures are readily evident after the 
15 reoniering. Not only can genesets be readily identified bom vertical striping in Fig. 16D. but 
sets of experiments associated with the activation of particular genesets are also identified 
fiom horizontal striping in Fig. 16D. Fig. 17 gives a more detailed view of Fig. 16D. and 
details the experiment assigmnents and some of the geneset assigmnents in tiie reo-ordered 
form of Fig. 16D. For example, tiie 'CNA' vertical stripe indicated in Fig. 17 is tiie 
20 calcineurin-depcndent geneset. which is affected {i.e., transcription repressed) by all tiie 
experiments involving immunosuppressive drugs in cells except tiiose where ttie mtermediate 
targets of the drug, or calcineurin itself, have been removed witii mutations. The expenments 
contributing to tiie large horizontal sttipe all activate sets of genesets which are mostly Gcn4. 
dependent. This is particular evident when tiiese experiments are compared wifli the top two 
25 rows of Fig. 17 which comprise experiments wherein Gcn4 has been deleted. 

6.5. pvAMPTF S: PRO TT^rTTNr.mrrPBOFn.F. ARTIFACTS 
Two sets of experiments were performed according to the reverse transcription 
procedure described in Example 1 (Section 6.1.1 supra) where the effect of deletion of tiie 
30 YJL107C gene was measured. In one of tiie two experiments, RNA concentration m tiie 
procedure was (intentionally) poorly controlled, tiiereby generating response profile data tiiat 
was contemiinated by artifacts. The correlation between die two profiles, determined by 
Equation-7. is shown in Fig. 18. Asterix symbols (*) indicate tiiose^anscripts which were 
up. or down-regulated in dttier of tiie two experiments at a confidence level of 90% or more. 
35 The correlation coefficient between tiie two experiments is 0.82. 
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An artifact template, characterizing the effect of poor control of RNA concentration in 
a reverse transcription procedure, was generated by measuring transcript levels in S, 
cerevisiae wherein the RNA concentration was intentionally varied. Thus, a response profile 
was obtained wherein the **perturbation" was, in fact, the variation of RNA concentration in 

5 the reverse transcription procedure. This template is plotted in Fig. 19 as gene «pression 
ratio vs. mean expression level. Those transcripts which were up- or -down regulated at the 
90% confidrace level were labeled with their names and have one-sigma error bars. 

The response profile corresponding to the contaminated YJL107c deletion experiment 
was cleaned using this artifiwt template. Specifically, best scaling coefficients were 

10 determined by least squares minimization of Equation 16, and a "cleaned** response profile 
was generated with these coefficients according to Equation 17. The correlation between the 
"cleaned** YJLlOTc deletion experiment and the correspondmg *'uncontaminated*' experiment 
is shown in Fig. 20. The correlation is improved to 0.87. IN the absence of significant 
artifacts, otha: sources of random measurement error commonly limit the correlation between 

15 nominally repeated measurements ofprofiles to about 0.90. Thus, the improvement fi-om 
0.82 to 0.87 represents nearly the maximum amount of improvement that is realistically 
possible with any artifact removal technique. 

7. REFERENCES CITED 
20 All references cited herein are incorporated herein by reference in their entirety and 

for all purposes to the same extent as if each individud publication or patent or patent 
application was specifically and individually indicated to be incorporated by reference in its 

entirety for all purposes. 

Many modifications and variations of this invention can be made without departing 
25 from its spirit and scope, as will be apparent to those skilled in the art. The specific 

embodiments described herem are offered by way of example only, and the invention is to be 
limited only by the terms of the appended claims, along with the fiiU scope of equivalents to 
which such claims are entitled. 

30 



35 
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^^TTRn^AIMEDIS: 

1 . A method for analyzing a biological sample comprising converting a first profile of a 
plurality of measurements of cellular constituents in said biological sample into a projected 
5 profile containing a plurality of cellular constituent set values according to a definition of co- 
varying basis cellular constituent sets, wherein said definition is based i^on the co-variation 
of said ceUular constituents under a pluraUty of different perturbations, and wherein said 
converting comprises projecting said first profile onto said basis ceUular constituent sets. 

10 2. The method of claim 1 , whoein the plurality of dififCTent potuibations comprises at 
least five diffraent perturbations. 

3. The mettiod of claim 2, wherein the pluraUty of different pertuibations comprises 
more than ten diffoent poturbations. 

4. The method of claim 3, wherein the pluraUty of different patuibations comprises 
more than 50 difforent pertuibations. 

5. The method of claim 4, wherein the plurality of different perturbations comprises 
20 more than 100 differoit pertuibations. 

6. The method of claim 1 fiulher comprising the step of indicating the state of said 
biological sample with said projected profile. 

25 7. The method of claim 1 fiirther comprising the steps of comparing said projected 
profile with a reference projected profile, and indicating similarity or difference between said 
projected profile and said reference profile. 

8. The method of claim 1, wherein said definition is based upon the co-variation of said 
30 cellular constituents under a pluraUty of different pertmbations. 

9. The method of claim 8 wherein said definition is defined by a similarity tree derived 
by a clust« analysis of said cellular constituents under said pluraUty of pertmbations. . 

35 10. The method of claim 9 wherein said cellular constituent sets are defmed as branches 
of said similarity tree. 
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11. The method of claim 10 wherein said branches are selected by applying a cuttmg level 

across said tree, wherein said cuttmg level is determined by expected number of biological 

♦ 

pathways represented by said cellular constituents. 

5 12. The method of claim 10 wherein distinction among said branches achieves a statistical 
significance at 95% confidence level. 

13. The method of claim 12 wherein said statistical significance is evaluated with a test 
using Monte Carlo randomization of an index of said pertuibatioiis. 

10 

14. The method of claim 13 wherein the test using Monte Carlo randomization comprises: 

(a) determining an actual firactional improvement in cluster analysis of said 
cellular constituents; 

(b) generating permuted cellular constituents by means of Monte Carlo 
1 5 randomization of each perturbation for each cellular constituent; 

(c) performing cluster analysis on the permuted cellular constituents; 

(d) determining the firactional improvements in the cluster analysis of the 
pemiuted cellular constituents; and 

(e) repeating said steps of generating permuted cellular constituents and 

20 performing cluster analysis on the permuted cellular constituents so that a 

distribution of fractional improvements is obtained, 
wherein the statistical significance is determined by comparing the acmal firactional 
improvement to the distribution of fractional improvements. 

25 1 5. The method of claim 12 wherein said statistical significance is evaluated with a test 
using Monte Carlo randomization of a time index of a biological response to one or more 
perturbations. 

16. The method of claim 10, 1 1, or 12, wherein said defined cellular constituent sets are 
30 refined based upon biological relationships among said cellular constituents. 

1 7. The method of claim 1 wherein said definition is: 

>0) . yW 



35 



yO) . y(n) 



-80- 



wo 00/24936 PCT/US99/25025 
wherein is the contribution of cellular constituent k to cellular constituent set n. 

18. The method of claim 1 7 wherein said step of converting comprises the execution of 
the operation: 

wherein is cellular constituent set value i and vector p is a profile of cellular constituents. 

19. The method of claim 1 wherein each of said cellular constituent set values is the 
average value of the level of said cellular constituents within a corresponding cellular 
constituent set 

20. The method of claim 1 wherein each of said cellular constituent set value is a 
weighted average of the level of said cellular constituents within a coirespondmg cellular 
constituent set 

15 

21. The method of claim 1 wherein said plurality of measurements is normalized to a 
unity vector size. 

22. The method of claim 1 wherein said measurements of cellular constituents are 
measurements of responses of said biological sample to a perturbation. 

23. A me&od for analyzing a biological sample comprising: 

(a) converting a first profile of a plurality of measurements of cellular constituents 
in said biological sample into a projected profile containing a plurality of 

25 cellular constituent set values according to a definition of co-varying basis 

cellular constituent sets, wherein said converting comprises projecting said first 
profile onto said basis cellular constituent sets; 

(b) comparing said projected profile with a reference profile; and 

(c) indicating similarity or difference between said projected profile and said 
reference profile. 

24. The method of claim 23 wherein said definition is derived &om the co-regulation of 
said cellular constituents. 

25. The method of claim 23 wherein said definition is based upon the co-variation of said 
cellular constituents under a plurality of different perturbations. 
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26. The method of claim 23 whraein said definition is: 



• • • 



wherein is the contribution of cellular constituent k to cellular constituent set a 

27. The method of claim 26 wherein said step of converting comprises the execution of 
the operation: 

P = [Pi,..P/...P«]= p^V 
wherein is cellular constituent set value i and vector p is a profile of cellular constituents. 

28. The method of claim 23 wherein each of said cellular constituent set values is the 
average value of the level of said cellular constituents within a corresponding cellular 
constituent set. 



20 



29. The method of claim 23 wherein each of said cellular constituent set value is a 
weighted average of the level of said cellular constituents withm a corresponding cellular 
constituent set. 



30. The method of claim 23 wherein said plurality of measurements is normalized to a 
25 unity vector size. 

3 1 . The method of claim 23 wherein said measurements of cellular constituents are 
measurements of responses of said biological sample to a perturbation. 



30 



32. A method for analyzing a biological sample comprising converting a first profile of a 
plurality of measurements of cellular constituents in said biological sample into a projected 
profile containing a plurality of cellular constituent set values according to a definition of co- 
varying-basis cellular constituent sets, 
wherein said definition is provided by the expression 



35 
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5 • 

in which V<"\ is the contribution of ceUular constituent k to cellular constituent set n, and 
wherein said converting comprises projecting said first profile onto said basis cellular 
constituent sets. 

33. The method of claim 32 wherein said step of converting comprises the execution of 
tiie opoation: 

P=[Px...Pi...Pn] = p*V 
wherein P, is ceUular constituent set value i and vector p is a profile of cellular constituents. 

^^34. A method for analyzing a biological sample comprising converting a first profile of a 
lurality of measurements of cellular constituents in said biological sample into a projected 
profile containing a plurality of cellular constihwnt set values according to a definition of co- 
vaiying basis ceUular constituent sets, each of said cellular constitiient set values being a 

20 weighted average of the level of said ceUular constituent within a corresponding cellular 
constituent set, wherein said converting comprises projecting said first profile onto said basis 
ceUular constituent sets. 

35. A method for analyzing a biological sanq>le comprising converting a first profile of a 
25 ptarality of measurement of ceUular constituents in a biological sample into a projected 

profile containing a plurality of cellular constituent set values according to a definition of co- 
varying basis cellular constituent sets, said plurality of measurements being normalized to a 
unity vector size, wherein said converting comprises projecting said first profile onto said 
basis cellular constituent sets. 

36. A method of grouping biological response profiles according to the similarity of the 
responses, said method comprising defining similar response profile sets based upon the 
similarity of a pluiaUty of measured cellular constituents in said response profiles. 

35 37. The method of claim 36, fiulher comprising tiie step of forming a clustering tree 
derived by a cluster analysis of similarity of the plurality of measured ceUular constituents in 
said response profiles. 
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38. The method of claim 37, wherein groups of said biological response profiles are 
defined as branches of said clustering tree. 

39. The method of claim 36, further comprising deteraiining a statistical significance of 
5 the groups of biological response profiles. 

40. The method of claim 39, wherein the statistical significance of the groups of 
biological response profiles is deteraiined by means of an objective statistical test. 

10 41. The method of claim 40, wherein the objective statistical test comprises: 

(a) determining an actual fractional improvement in cluster analysis of the 
biological response profiles; 

(b) generating permuted response profiles by means of Monte Carlo randomization 
of each cellular constituent for each response profile; 

15 (c) performing cluster analysis on the permuted response profiles; 

(d) deteraiining the fractional improvement in the cluster analysis of the pOTnuted 
response profiles; and 

(e) repeating said steps of generating permuted response profiles and performing 
cluster analysis on the permuted response profiles so that a distribution of 

20 ' firactional improvements is obtained, 

wherein the statistical significance is determined by comparing the actual fractional 
improvOTient to the distribution of fractional improvements. 

42. A method for analyzing a biological sample comprising: 

25 (a) grouping cellular constituents from the biological sample into sets of cellular 

constituents that co-vary in biological profiles obtained from the biological 
sample; and 

(b) grouping the biological profiles obtained from the biological sample mto sets 
of biological profiles that effect similar cellular constituents. 

30 

43. The method of claim 42, wherein one or more cellular constituents associated with a 
particular biological effect are identified from said sets of cellular constituents. 

44. The method of claim 42, wherein one or more biological profiles associated with a 
35 particular biological effect are identified from said sets of biological profiles. 
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45. The method of claim 43 or 44, wherein the particular biological effect is a biological 
pathway. 

46. The method of claim 43 or 44, wherein the particular biological effect is a disease or 
5 disease state. 

47. The mediod of claim 43 or 44, wherein the particular biolo^cal effect is the effect of 
treatment with one or more drugs. 

10 48. The method of claim 43, wherein the cellular constituents fifom the biological sample 
comprise a plurality of genes, and one or more genes associated wifli a particular biological 
effect are identified. 

49, The method of claim 46, wherein the one or more genes identified comprise known 
15 genes. 

50. The method of claim 46, wherein the one or more genes identified conq)rise 
previously unknown genes. 

20 5 1 . The method of claim 42, wherein one or more perturbations associated with a 
particular biological effect are identified from said sets of biological profiles. 

52. The method of claim 49, wherein the one or more perturbations conq}rise a drug or a 
drug candidate. 

25 

53. The method of claim 50. wherein the one or more perturbations comprise a genetic 
mutation. 

54. The method of claim 50 wherein the drug or drag candidate is a known drug or drag 
30 candidate. 

55. The method of claim 51, wherein the genetic mutation is a known genetic mutation. 

56. The method of claim 50, wherein the drag or drag candidate is a previously unknown 
35 drag or drag candidate. 
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57. The method of claim 51, wherein the genetic mutation is a previously unknown 
genetic mutation. 

58. A method for analyzing an N-dimensional array of data, N being a positive integer, 
5 wherein each element of the N-dimensional airay of data has N indices, said mefliod 

comprising grouping each index into sets of data that co-vaiy within the N-dimensional array 
of data. 

59.. The method of claim 56, wherein each of said sets is defined by a similarity tree 
10 derived by a cluster analysis of each of said indices. 

60- A method for removing one or more artifacts from a measured biological profile 
comprismg a plurality of measurements of cellular constituents, said method comprising 
subtractmg one or more artifact patterns from the measured biological profile, wherein each 
1 5 of said one or more artifact patterns coiresponds to a particular artifact. 

61. The method of claim 58, wherein the each of the one or more artifact patterns is 
provided by knowledge of the genes and relative amplitutdes of responses associated with 
particular artifact to which each of the one or more artifact patterns corresponds. 

20 

62. The method of claim 58, wherein each of the one or more artifact patterns is provided 
by experiments with perturbations of suspected causative variables of the particular artifact to 
which each of the one or more artifact patterns corresponds. 

25 63. The method of claim 58, wherem each of the one or more artifiw^t patterns is provided 
by a cluster analysis of control biological profiles, the control biological profiles comprising a 
plurality of measurements of cellular constituents in experiments wherein the artifact to 
which each of ttie one or more artifact pattern corresponds arises. 

30 64. The method of claim 58, wherein of the one or more artifact patterns are scaled by 
scaling coefficients, each of the one or more artifact patterns having a particular scaling 
coefficient. 

65. The method of claim 62, wherein the scaling coefficients are determined by a method 
35 comprising determining the value of each particular scaling coefficient which minimizes the 
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value of an objective function of the difference between the measured profile and the sum of 
the one or more scaled artifact patterns. 

66. The method of claim 63, wherein the objective function is a least squares 
5 minimization. 

67. The method of claim 58, wherem each of the one or more artifact patterns is selected 
ftom a library of artifact signatures, said artifact signatures corresponding to levels of severity 
of each the one or more artifacts. 

10 

68. The method of claim 65, wherein the artifact signatures are selected by a method 
comprising determining the artifect signatures which minimize the values of an objective 
function of the difference between the measured profile and the sum of the one or more 
arti&ct signatures. 

15 

69. The method of claim 1 , wherein the plurality of different perturbations comprises a 
plurality of graded levels of exposure to a particular perturbation. 

70. The method of claim 67, wherein the particular perturbation is a drug or drug 
20 candidate. 

71 . The method of claim 1 , wherein said definition is based upon the co-variation of the 
cellular constituents over a period of time. 



25 72. An array of polynucleotide probes, said array comprising a support with at least one 
sur£u:e and a plurality of different polynucleotide probes, wherein each different 
polynucleotide probe: 

(a) is attached to the surface of the support at a different location on said surface; 

(b) comprises a different nucleotide sequence; and 

30 (c) hybridizes to an expression product of a particular gene witiiin a single geneset 

of a plurality of genesets, in which 

(i) said plurality of genesets is provided by a method comprising 

grouping genes from a biological sample into sets of genes that co- 
vary in biological profiles obtained from the biological sample, and 
35 (ii) the number of different polynucleotide probes for each geneset that 

hybridize to an expression product of a different particular gene 
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within said geneset is less than the total numbo: of genes in the 
geneset. 

73. The array of claim 72 wherein the plxirality of different polynucleotide probes 

5 hybridizes to expression products of genes from between 50 and 1,000 different genesets. 

74. The array of claim 73 wherein the pluraUty of different polynucleotide probes 
hybridizes to expression products of genes from between 100 to 500 diflferent genesets. 

10 75. The array of claim 74 wherein the plurality of diflferent polynucleotide probes 
hybridizes to expression products of genes torn between 100 to 200 different genesets. 

76. The array of claim 72 wherein each particular gene is selected from a diflferent 
geneset. 

15 

77, The array of claim 72 wherein the plurality of different polynucleotide probes 
hybridizes to expression products of no more than 10 particular genes from any one geneset. 

20 
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